Making My Own Anime Database (part 1)

WHY?

Simply, I wanted to build a recursive¬†web scraper/crawler and an updated anime database parsed in JSON was lacking on github.¬†And I’m doing so!¬†So what exactly¬†are the steps to make your own anime database?

First off, you can’t be¬†doing manual data¬†entries. You need a web crawler. And I’m¬†targeting MyAnimeList.¬†Not in any bad sense, love the site. o.o

MAL has it’s own API but it’s¬†terrible. You can not retrieve anime info without 3rd party APIs and wrappers. I’ve made Stats::Extract, ¬†which extracts data from an html file so this shouldn’t be¬†too advanced for me.

THE STEPS

  1. Make the Crawler (done!)
  2. Make the Wrapper (working on it!)
  3. Make the Scraper (not even a single line)

In this post, I’ll emphasize on

MAKING THE CRAWLER

The crawler is a script that¬†requires an entry point, a link if you may, to the web page and then from there it searches for whatever you’re looking for. In my case, I’m targeting the Anime (will do the mangas too).

The entry point is: https://myanimelist.net

What I’m looking for: https://myanimelist.net/anime/{anime id}

So, after¬†crawling into the entry point, it looks for anime page links and adds them to the “queue pool“. But it doesn’t end there. It does its job as a crawler and iterates through the queue pool, loading each and every page and further on adding more links extracted from those pages to the pool! ¬†Now, this is a long process.

If you¬†understood what I’m having it do, you might as why in the world don’t I extract the anime¬†info¬†using a wrapper since I’m already on it’s web page?! Well, you see. By the time I was done, I realized that the process was so slow. I’ve started researching multithreading/forking in PHP so I can utilize that on the Scraper instead. Further more,¬†I had the scraper only go through 2000~ anime listings until I got tired of it. It proved my point, it was working. I could use it for¬†anime that get newly added in the database or something.

I got the rest of the animes from users on MAL which had the most watched anime entries.

The crawler is completely CLI (command line interface). The Wrapper will be a PHP Library and the Scraper will be CLI too.

I’ll release the source code on github when it’s a presentable state (soon).

THE PLAN

  1. Make a basic wrapper which fetches anime information (such as name, episodes, studios, producers, ratings, date aired, genre, etc). This would be a simple wrapper for the database which doesn’t need¬†all the information stored on¬†MAL anime pages.
  2. Make a scraper with multithreading/forking to use the anime database of their MAL links I have right now to fetch their data and make my database.
  3. Re-write the wrapper as a complete NON-AUTHENTICATION API to fetch each and everything about anime, manga, people, character, etc. Basically a complete wrapper for the whole site. And release it on github because¬†MAL’s own API is lackluster.
  4. Re-write the scraper with the¬†crawler and the wrapper as it’s main components. So this time, asynchronously, the scraper will add anime links to the pool and extract the anime information on those pages directly. This could¬†probably be the ultimate MAL Scraper.

That is, if I get it done.

Oh, and a sneak peak at the wrapper.

wrapper

 

Part 2: https://irfandahir.wordpress.com/2017/05/13/making-my-own-anime-database-part-2/

Advertisements

Project.Extract Cloud (Alpha) is live

There’s been delays but it’s here. The Alpha version of the CS2D log data extractor, Project.Extract Cloud, is up and running. There’s are¬†some stuff left to do. I’ll explain this in a second.

Other than that you can only extract 1 file.¬†I might as well set this as the limit. I’m gratified to be hosted for free by¬†BroHosting¬†as a¬†testing for their hosting services and so far there’s absolutely no¬†critical problems.

 

 

checkout.png

WHAT YOU SHOULD CHECK OUT!

That would be the server statistics functionality. The core of the application lies within there. Feel free to drop in whatever log of your choice and get as much as information out of it as possible!

 

TODO!

Text Searching

The text searching page right now is bare minimum,¬†it’s simply 5% done. It’ll look more polished and organized like the ‘server statistics’ page.

User Database

User database will be a offline feature only of PE4, it’ll¬†automatically store player information¬†as a database for you to easily access.

Server Statistics Polishing

As complete as it looks, it’s still a bit far from done. First off, the map graph you see is a complete dummy. It’s not implemented at all. Secondly, there are some design polishing I need to do. Apart from that I want to see if I can fit in more data and graphs in there.

Usage Statistics

You’ve¬†probably noticed a blank space in the black bar at the top after you click it. What’s meant to be stored there is a graph of your usage statistics of the browser app.¬†The core functionality of this is complete but I’m planning to add the graphs and such at the end.

 

PRIVACY

Some of you might be wondering about the log files that you’re uploading to the server.¬†I’ll let you know before hand that these log files are stored. The reason for this is that they’re cached incase you reload the page.¬†An JSON format of the extracted contents are stored as well.

When I release beta, what I said will still be applicable to your offline version of Project.Extract but the cloud version won’t store anything. Nothing will be cached.

 

That’s it for now, until the beta phase.

The Official Follow Up

So, I’m back with another blog. I do realise that the previous one (bootyphpandi.wordpress.com) was lacking a decent name and so I took it upon myself (again) to bring it to a professional state. I had plans of making my own CMS but again I realised that I’d have to hit up AngularJs and some more alpha type stuff to make it look like a decent CMS. Plus due to the lack of time, I’ll be using this as my official blog.

Shoutout to Tonal theme as I really love this minimalistic freebie.

Some Updates

Portfolio Polishing

I updated my own portfolio (irfandahir.com). The design was left unfinished so I polished it out a bit after recieving some insight and critique from forum boards. I still feel it’s lacking so I’m devising plans to make it look nicer.

 

Omilos by  id

Introducing Omilos

I’ve made another freebie, Omilos. I thought “Omilos” was greek for “something big” but my greek buddy corrected me once more. Nevertheless, the design is upto the level of being used. IMO it is lacking some design fundamentals and has some flaws but it will get your job done as it’s coded as cleanly possible. If you hire a designer or have some coding skills, you’ll be able to adjust it to your needs.

Demo | Download

 

Project Extract Cloud

Project.Extract 4

If you’re a CS2D player then you might know what this is, if not then here’s a breif explaination. Project.Extract (including legacy versions 1, 2 & 3) have been downloadable apps which run through your browser with the dependacy of Apache & PHP (WAMP, LAMP). CS2D generates a fair amount of logs files and so I created a PHP Library which would extract useful amounts of information from these logs. And Project.Extract is the visual version of that. The Legacy versions 1-3 only extracted user information and had text based searching.

So a year later, after leveling up multiple times in PHP I realised I could extract so much more. I’ve developed a PHP Library, Log Miner, for it which acts as a core for Project.Extract 4. Both the PHP Library and PE4 are in works. The difference between legacy versions and this is that this has the capability to extract ALOT more from your logs. Every single detail. And the awesome part? It’s both a web based app if you don’t know how to set it up and downloadable which removes limits. I’ll talk more about it once I’m ready to deploy it.

If you’re interested then these are repos you should keep an eye on.

[REPO][PHP Library] Log Miner

[REPO] Project.Extract 4

 

That’s it for now.

The Path To A CMS (Part 3)

Alright! So I managed to complete the article module as well as the theme module! The theme module will be able to load any theme given that it’s properly configured as I said in my previous post. Other than that I have made a default template for it which is even responsive (yay!). I’ll be completing the template along the rest of the completion of the CMS. So far that’s left is the design, admin panel and the articles page. The articles page shouldn’t take long. I would be needing to implementing a url parser for that, which reminds me! I managed to implement pretty urls for the pages after a thirty minute struggle on how to fix the broken style sheets and index links.
cm1

Unlike my prototype articles class, this one loads directly from a database using mysqli.
cms2
And stores the articles that are to be shown on the page in a public array, which can be called in the index page after calling the Articles Class. Doing this serves that you’d be able to use in any theme you add wherever.
cms3

Your CMS’s design looks quite familiar to the one you’re using on WordPress.
hahaha, what? Nonsense. >.>

The Path To A CMS (part 2)

So far I had no clue how to manage themes, but after a little pondering I managed to come up with a solution. I was too lazy to google it anyways and I believe this solution fits me best because I created it. No, I do not care whether it is an actual method, I manifested it within my own ideologies so therefore; it is mein.

Anyhows, this is pretty much how it will manage the templates.
So it will load themes as theme1,theme2,theme3 and so on using the swagging configuration parser. Now using the values of the keys, I’ll take control of the explode() function with the delimiter of ‘.‘ to get the correct values.

themes class

 

 

themes config

This would be a prototype of how the themes ‘database’ would be like. If you’re wondering what the ‘.pb‘ extension is, that’s the CMS’s own configuration extension abbreviated as ‘project blog‘.

So yeah, I suppose this concludes the theory behind the themes management module. I will update what goes in the upcoming posts. It might be of another module as I lack the ability to finish one module at a time and I usual end up doing bits and pieces of everything all together. Oh well.

The Path To A CMS (part 1)

And so while I took the last sip of my everlasting (somewhat) dew, an idea struck of why not go in all the way? (no innuendo intended) Dedicating time to develop a functioning CMS. Time is something I probably have right now, so why not? It would not be a complex CMS like WordPress, Joomla or whatsoever. However a minor one, that would classify as my first. (Again, no innuendo intended)

And I started off with a directory of what I’ll be currently needing. So far at the roof of the directory I have index, admin, article & a error page with two more directories; interface & core. The interface directory will be dealing with themes and the core directory will, obviously, be dealing with the main thing itself.

The index page will be implemented with the top first 5 articles with the pagination system I developed a few days ago.
The article page will deal with per article display. Whatever article the user chooses to visit would be displayed there.
The error page will deal with error re-directs, mostly from the .htaccess file.
And so far, lastly, the admin page. I’ll be developing this in the end however. This page, you guessed it, will deal with administration stuff.

The core will contain these classes:

  • class.articles.php
  • class.database.php
  • class.theme.php
  • class.parser.php
  • class.admin.php

As you probably guessed it, multiple themes will be something I’ll be implementing. I developed a prototype of this a while back and it’s something that my portfolio right now uses. Although I haven’t really implemented other templates other than the current one yet.
Parser class will deal parsers that I may implement, here’s to one that I’m definitely implementing is my swagging configuration parser which will deal with template source links, titles and whatnot. I do not yet know If I should allow database credentials to be stored in here, it may be risky because security hole. But let’s see what do.

I have the whole night plus a leftover pizza left, let the project begin.

The Config Parser With Swag

cfgparser2

So this is another minor project I worked on. Playing around with strings of text ends up to be quite fun, especially when you get it right the first few times. So, What’s the use of this crap? Simply put, using configuration files in browser based applications is quite useful and more user-orientated. I solely made this for the purpose of using it in a project I had been working on called ‘Project.Extract‘. That would be something to post about later on.

The parser consists of two public functions.

public unpack()

public pack()

I should get a code-display plugin for WordPress. Any hows, these two functions as you can see can be used to parse any configuration file you’ve got in mind. It has to be of a valid extension provided in the $config_preset_header_info array of valid extensions. And of course, when you’re done parsing, it loads the values into the $config array.

cfgparser1Other than that, It provides accurate variable types. You won’t get any PHP parsing errors where ‘age‘ would be accounted as a¬†string. And know how annoying it is when your ‘true‘ values become ‘1‘ in the parsed file? Well, no more!

So in a conclusion I could say, this minor project got me around parsing files and whatnot, and most likely could be applicable in future projects. I’m still debating over whether I should add a nesting option for sub-configuration values for values. Something like this:

[main]

blah=blah

somethingelse=false

[side]

dis=dat

You get the idea. (I seriously need to get a code-management display plugin…)

You can view the source and/or download it here.