How I Dumped My Entire Careem Rides Data

careem.png

 

I use Careem. A lot. The most I’ve used it was until earlier this year for booking rides to and from my College. It’s one of the most affordable modes of transport in Pakistan right now.

For those who don’t know what Careem is, it’s a ride hailing service like Uber.

As a fan of data, I really wanted to know how much I’ve spent, covered distance, etc, over the period I’ve used Careem. Unfortunately, there’s no public API made available by Careem. There is something of the likes however you have to contact them for access and I haven’t got a reply yet. 😦

With my experience in building a successful REST API service, I decided to turn to reverse engineer their website dashboard for users. This dashboard allows you to book rides online and shows you your ride history.

Screenshot from 2018-10-18 15-07-07.png

My initial thought was web scraping but that required me to create a scraper that authenticated through their web form, bypass their CSRF protection and that hidden Google ReCaptcha. If I could that by passing around headers and sessions, this would’ve allowed me to develop an unofficial API but ain’t nobody got time fo dat. What I really wanted was my rides data.

One way to do so is to go to the rides page and intensly click on “Load More” until you get a good amount to “select all” and click “Export as CSV”.

Screenshot from 2018-10-18 15-09-54

This method wasn’t really cutting out to be efficient, I needed something that’d be much more faster. And I got something really interesting that output a lot more data than what they’d want to show on the frontend.

Careem’s frontend, like any other, gets it’s data from their server via GET or POST requests. I popped up the handy Chrome Dev Tools and performed a network monitor for any requests being made as I clicked on the “Show More Rides” button – since it updates the DOM (updates the page with more data without reloading).

And this brought my attention to the following request:

Screenshot from 2018-10-18 15-15-23.png

This request fetches the most recent 10 rides. This is why when you check your most recent rides on the dashboard or the Careem App, it only shows 10 at a time.

Let’s look at some useful parameters being thrown in there.

start and limit, where the former is the starting offset of rides and the limit is how many I’d want to fetch. There’s a few other request props such as serviceAreaId and key. Both of them were of no use and could be removed from the request.

Careem makes this request via POST which means you shouldn’t be able to view it via your browser URL like I did (screenshot below).

Screenshot from 2018-10-18 15-19-23.png

But it works anyways, so why not ¯\_(ツ)_/¯

Anywho, as you can see. There’s 10 items in the data array. This is my 10 recent rides. I don’t know what’s up with the other properties like results and success. I don’t know why they’re null when this request is obviously successful. Possibly leftover development code.

So I successfully went up to requesting from 10 rides to 100 rides data in a single request. My GET request would be this https://app.careem.com/getAllAccessibleCompletedTrips.json?start=0&limit=100

Screenshot from 2018-10-18 15-26-56.png

Oh also, I do need to mention if you’re not aware, you have to be logged in to view this URL.

And for some reason this sometimes returned an error and didn’t work out if you give it a large range. e.g 200 rides. Would respond with the following:

Screenshot from 2018-10-18 15-27-16.png

 

Downloading my data

I only needed to make 2 requests and download the content since I only had 380 rides. The data is in a proper data structure format called JSON, which makes the work for me so much easier. All my rides were in a property called data, which was an array, in both files and I just had to write a simple script to merge them both into one.

Screenshot from 2018-10-18 15-30-07.png

This is the COLLAPSED (not all the info as you can see most of the arrays are collapsed) data structure. Pretty convenient.

 

What can I do with this data?

Now I have this huge dataset telling me about anything and everything about the ride. This includes trip pricing break downs, dropoff and pickoff metadata and coordiantes, total price, distance covered, waiting time, in route time, when the driver arrived, how long the driver had to wait, when we reached our destination, whether I (the client) or the driver is verified, whether my ride was waived, client data, driver data, car data, car type and A WHOLE LOT more.

 

Introducing Careem Analyzer

composer require irfan/careem-rides-analysis

I spent the next few hours developing a small PHP parser to read the important parts of data available, which is open source and available here. Visit that link to read more about what kind of data you can access for each ride.

And now I was finally able to produce a POC, called in the library and looped through my rides to sum how much I had used Careem.

Rs. 40,731 and 2502.82 Km.

Wow that’s a lot… less. Compared to other direct modes of transport such as Rickshaws, that is. An estimate 45% in savings had I taken a Rickshaw for my college transport. Possibly a lot more, this is just a very basic calculation.

So let’s go deeper. I’ve made an example.php file which uses the parser library to analyze the data and even create a CSV with every driver’s info.

Here’s the output for my rides:

Total Rides: 380
Total Spent: Rs. 40,731
Total Distance: 2502.8235 Km
Average price per km: 16.274 Per Km
Traveled in: GO, Bike, Go Mini, GO+ car types

Waived Rides: 17
Avg. In Journey Wait Time: 4.98 min
Avg. Initial Wait Time: 1.97 min
Total In Journey Wait Time: 1886 min
Total Initial Wait Time: 746 min

—BREAKDOWN—
Car Type: “GO”
Rides: 217 ride(s)
Total Spent: Rs. 25742.9
Avg Price/Ride: Rs. 118.63 /Ride
Avg Price/Km: Rs. 17.67 /Km
Avg. Distance/Ride: 6.71 Km/Ride
Avg. Duration/Ride: 15.33 Min/Ride
Total Distance: 1457 Km
Total Duration: 55.45 Hours

First Ride: Sunday Feb 5, 2017 at 9.38am

Car Type: “Bike”
Rides: 64 ride(s)
Total Spent: Rs. 2175.73
Avg. Price/Ride: Rs. 34 /Ride
Avg. Price/Km: Rs. 5.35 /Km
Avg. Distance/Ride: 6.35 Km/Ride
Avg. Duration/Ride: 14.98 Min/Ride
Total Distance: 406 Km
Total Duration: 15.98 Hours

First Ride: Monday Mar 19, 2018 at 8.34am

Car Type: “Go Mini”
Rides: 1 ride(s)
Total Spent: Rs. 87.93
Avg. Price/Ride: Rs. 87.93 /Ride
Avg. Price/Km: Rs. 27.36 /Km
Avg. Distance/Ride: 3.21 Km/Ride
Avg. Duration/Ride: 13.77 Min/Ride
Total Distance: 3 Km
Total Duration: 0.23 Hours

First Ride: Sunday Aug 12, 2018 at 8.13pm

Car Type: “GO+”
Rides: 98 ride(s)
Total Spent: Rs. 12726.96
Avg. Price/Ride: Rs. 129.87 /Ride
Avg. Price/Km: Rs. 20 /Km
Avg. Distance/Ride: 6.49 Km/Ride
Avg. Duration/Ride: 15.06 Min/Ride
Total Distance: 636 Km
Total Duration: 24.59 Hours

First Ride: Friday Mar 24, 2017 at 8.34am

 

And here’s some cool graphs from the CSV dump for driver info.

Screenshot from 2018-10-18 16-12-41.png

Color of the cars that I rode in

Screenshot from 2018-10-18 16-12-53.png

Make of the Cars

Screenshot from 2018-10-18 16-13-22.png

Model of the cars

Screenshot from 2018-10-18 16-13-32.png

Car Build Years

Do note, the data is messy and I’m not bothering to tidy up because this is PoC only and I’m bored of this project already.

Is this a security risk for Careem?

With what happened earlier this year in mind, I doubt this is a security risk as it’s not a hack. I don’t have any other person’s ride data available except for mines and mines only.

But there is a slight concern I had with the data that was available for each driver. Although the data is displayed on the frontend as well, in this bulk amount it could be very useful for marketing or something.

Nevertheless, I feel that Careem possibly puts too much driver information in the hands of the client. Although it’s understandable why might be needed for the client, such as losing something in their car or wanting to report the driver. But the data ranges all the way back to your initial ride. And that’s something to think about.

¯\_(ツ)_/¯

 

That’s all for this post and project.

Advertisements

Making My Own Anime Database (part 1)

WHY?

Simply, I wanted to build a recursive web scraper/crawler and an updated anime database parsed in JSON was lacking on github. And I’m doing so! So what exactly are the steps to make your own anime database?

First off, you can’t be doing manual data entries. You need a web crawler. And I’m targeting MyAnimeList. Not in any bad sense, love the site. o.o

MAL has it’s own API but it’s terrible. You can not retrieve anime info without 3rd party APIs and wrappers. I’ve made Stats::Extract,  which extracts data from an html file so this shouldn’t be too advanced for me.

THE STEPS

  1. Make the Crawler (done!)
  2. Make the Wrapper (working on it!)
  3. Make the Scraper (not even a single line)

In this post, I’ll emphasize on

MAKING THE CRAWLER

The crawler is a script that requires an entry point, a link if you may, to the web page and then from there it searches for whatever you’re looking for. In my case, I’m targeting the Anime (will do the mangas too).

The entry point is: https://myanimelist.net

What I’m looking for: https://myanimelist.net/anime/{anime id}

So, after crawling into the entry point, it looks for anime page links and adds them to the “queue pool“. But it doesn’t end there. It does its job as a crawler and iterates through the queue pool, loading each and every page and further on adding more links extracted from those pages to the pool!  Now, this is a long process.

If you understood what I’m having it do, you might as why in the world don’t I extract the anime info using a wrapper since I’m already on it’s web page?! Well, you see. By the time I was done, I realized that the process was so slow. I’ve started researching multithreading/forking in PHP so I can utilize that on the Scraper instead. Further more, I had the scraper only go through 2000~ anime listings until I got tired of it. It proved my point, it was working. I could use it for anime that get newly added in the database or something.

I got the rest of the animes from users on MAL which had the most watched anime entries.

The crawler is completely CLI (command line interface). The Wrapper will be a PHP Library and the Scraper will be CLI too.

I’ll release the source code on github when it’s a presentable state (soon).

THE PLAN

  1. Make a basic wrapper which fetches anime information (such as name, episodes, studios, producers, ratings, date aired, genre, etc). This would be a simple wrapper for the database which doesn’t need all the information stored on MAL anime pages.
  2. Make a scraper with multithreading/forking to use the anime database of their MAL links I have right now to fetch their data and make my database.
  3. Re-write the wrapper as a complete NON-AUTHENTICATION API to fetch each and everything about anime, manga, people, character, etc. Basically a complete wrapper for the whole site. And release it on github because MAL’s own API is lackluster.
  4. Re-write the scraper with the crawler and the wrapper as it’s main components. So this time, asynchronously, the scraper will add anime links to the pool and extract the anime information on those pages directly. This could probably be the ultimate MAL Scraper.

That is, if I get it done.

Oh, and a sneak peak at the wrapper.

wrapper

 

Part 2: https://irfandahir.wordpress.com/2017/05/13/making-my-own-anime-database-part-2/