For many years, manual data entry in Excel (sourcing from books, as seen in this video) or manual copy-pasting from websites, was my only way of creating databases. A slow process which limited the size of the databases I could make. Even in this slow process I made about 40 databases about cars, geography, real estate, gaming, etc, from pure hobby.
Starting from 2015 I offer web scraping services… in ANY field, not limited to automobiles. Scraping usually means running a software to visit a list of given pages and copy specific data from each page and put it in a database automatically. If you need something very different than my current databases, I can create new databases as long you provide a source of data, a website where to extract data from. Do not expect me to get data that you cannot find yourself in any form, for example:
– Do not think that I can compile a table with dimensions of car lights, bumpers, windows, etc. Such dimensions are not provided in car manuals. If you sell such car parts, measuring yourself your own parts is the only solution.
– Do not think that I can compile a table with number of cars sold in your country, breakdown by model, if your government is not tracking sales and making them public on the internet. Data needs to be available somewhere in order to scrap it and put into a database.
– Theoretically I can scrap data from any website, but only websites having the particular data you are interested in a consistent structure from page to page, can produce a good usable database. After automatic scraping, less or more manual work is needed to make database usable.
Simple data scraping service
This apply on websites where each item have own URL and data is not hidden in drop-down boxes or javascript codes.
There are few tools available online, usually free download but they are limited in functionality, limited in number of pages you can extract and in pages per second, unless you upgrade to paid subscription, which is ridiculously expensive for the number of pages you can extract. Although you can scrap yourself for free (small number of pages), may take few days to learn to use them efficiently. Most people do not have time to learn or money for monthly subscription. I can help you… my partner made in Visual Basic a scraping software comparable with the tools available online, but with no limit in number of pages, this allow me to scrap websites at lower price than you can do yourself.
My price will be a sum of the following 4 things:
Number of pages to be scraped: up to 1,000 pages = $20, up to 10,000 pages = $50, up to 100,000 pages = $200.
Number of columns (pieces of data to scrap from each page): 20 cents for each column.
Complexity: $0 fee for websites where all items are accessible from an index page, extra $ fee if items are displayed with infinite scrolling, pagination, enter data in search boxes, etc.
Work after scraping: certain websites do not provide data in the format you need, I charge extra fee to arrange data in Excel.
Complex data scraping service
This apply on websites having drop-down lists, search boxes, javascript codes, and require user to do some actions to get the page containing data you want to scrap. In this case online scraping tools are useless, so my partner will make a custom data scraper made in PHP or Visual Basic, this may take few days depending by his available time.
Price: usually within $200 to $500 range which I share with my partner, price depends more by complexity of website than by number of pages to be scraped.
For less than 200 records may be faster to copy-paste manually than coding a data scraping software.
Impossible data scraping
Car classifieds websites usually hide seller phone number and contact email, which can be revealed by clicking a button, this is done specially to prevent scraping and protect emails from being spamming. The only solution is to have a human visiting each page and copy-pasting this hidden data, which require large amount of time. If you are an insurance company willing to do SMS or email marketing and you intend to hire me to make a database of car owners sourcing data from classifieds website, most likely I can’t help you because of time.
Anti-scraping
Some websites look simple to scrap, but they turn complex because of IP blocking, CAPTCHA or other measures made either to prevent someone copying data from them or just to prevent DDOS attacks. If you ask for price before starting the job, you should be prepared for price changes. Need to do part of job to be able to tell final price.
Notes
The main advantage of working with me is that once I create a database I can have multiple people purchasing it, so you will pay just a small part of the cost of scraping (if database is something of my personal interest – cars or real estate). If you want to keep private, I can sell it just for you at higher price and not publish on website, but the BIG question is what I should do if a second customer ask me to scrap same website and he agrees to publish on website to get cheaper price? I reserve the right to sell to other people if they ask. If you ask to scrap a website out of my personal interests and unrelated with the fields covered by website, I will not publish it because is unlikely anyone else to purchase it, and you need to pay the full cost of scraping.
The databases published on website include FREE updates for one year, with higher update frequency for products with higher sale volume. But if you ask me to scrap a website privately “just for you” you need to pay for each update, price depending by how much time takes each re-scraping.
Scraping software runs at a speed of 0.5-2 pages per second, depending by website. So I may not able to do very large databases, for example if you want to scrap 1 million records with monthly updates. The limit of how many records I can scrap depends by amount of customers in current month.
Data scraping is legal or not?
Usually scraping is legal, but using scraped data in a public website may be illegal.
Depends… if the data is added by volunteers, or by sellers in classifieds websites, scraping is probably legal. But if authors of website hardworked to compile data from sources like car brochures or manufacturer websites, scraping is probably illegal, especially if you use their data in making your own website or other commercial purpose. Most websites contains dummy data (example: a bunch of cars having +/- 1 horsepower than official value) and if you use data copied from them, they can prove from where you stole data and make a lawsuit against you. BEWARE!
For a moment I became concerned if my European car database sourced from AutoKatalog books is a copyright violation, but I came in conclusion that it is fine, as long as mine is an original compilation with different data structure than the book, and it target online audience, while the AutoKatalog is a book sold in shops targeting car hobbyists. I am doing each year over 100 sales without having a single person worrying about copyright.
In case of America, Year-Make-Model is my original compilation sourced from Wikipedia and 3 more websites, while Year-Make-Model-Trim-Specs is web scraping from Edmunds.com website who is also offering API thus allow other websites using their data, so again is legal.
But, since I created Indian car database in 2015 sourcing data from carwale.com I started being concerned that what I am doing may be illegal.
Country matters: I had many customers in India asking me to scrap data from various websites. However, when someone from Europe or America ask me certain data that I don’t have and I propose him to scrap data from a website, most told me legal issues of web scraping.
Funny case: someone offered me to sell a car database that he claimed to have been creating it by working for 4 months, 8 hours per day, copy-pasting data from a website. From copyright point of view does not matter if you scraped automatically or typed every letter manually, as long you copied from a website your work is not original. He was probably not aware of scraping software. If you wasted few months doing something that could have been done in few hours, you’re an idiot (I was an idiot too doing such jobs before 2015 being not aware of scraping software, but small jobs only) and I am still doing in case of European database because I source data from books (offline sources), making an original product on the web.
Example of scraping projects done and their price
All scraping software save data in CSV format, but when it is about publishing on website, I save it as XLS and add borders, colors, headers and other visual enhancements to match the style of other products “Made by Teoalida”.
India Car Database – source: www.carwale.com – Made from personal interest because of numerous people asking me about indian car database. Being my first scraping project, took initially about 7 days to figure out how to do it, and later figured out that can do it in 2 days. Price: 30-120 euro.
India Bike Database – source: www.bikewale.com – Made after 2nd person requested a database of bikes sold in India. One of easiest projects, having no drop-down boxes but plain links to each bike page. 250 records, price: 25 euro.
CarWale On-Road Prices – source: www.carwale.com – Made for a customer, a difficult project taking about 20 hours of work in Visual Basic to make an application sending javascript requests to CarWale website to get price of each car in each city, application works at a rate of 2 requests per second, so 3100 cars × 510 cities = 1632000 seconds = 226 hours needed to get all on-road prices, RTO tax and insurance. Price: $300 USD of which $200 goes to programmer and $100 my fee for keeping scraping application running daily for a month.
Skyscrapers Buildings Database – source: www.emporis.com – Made from personal interest, put for sale for $150 (15000 buildings) and turned into a marketing failure, 1 year passed and nobody purchased it (except a customer asking me for make US buildings database, see below). Took about 20 hours to compile manually list of cities with buildings over 100 meters, then list of buildings from these cities, then used a software to automatically extract data about each building. 15000+ buildings. Emporis block my IP for 2 days if I access more than 3000 pages in one day, so data extraction was limited to 3000 buildings per day, which took about 1 hour daily for 6 days.
US Buildings Database – source: www.emporis.com – Made for a customer seeing above Skyscrapers database told me to make a similar databases with all types of buildings from USA,. 160,000+ buildings, had to run scraper in over 100 batches of max 2000 buildings and change IP, running again and again blocked URLs until I was able to get all buildings. 60 hours of work. Price: $600.
Singapore Condo Database – source: www.singaporeexpats.com – Made for a customer, took 3 hours and sold database with 2809 condos for $140.50 SGD.
Singapore Condo Database II – source: www.propertyguru.com.sg – Made for a customer. Apparently an easy project, having plain links to all condos, it turned difficult because of a fucking CAPTCHA page appearing every 10 pages extracted. My programmer partner spend 5 days in Visual Basic and charged me $300 USD, and at final sold database with 3176 condos for $317.60 SGD (about 240 USD), leaving me in loss, unless I sell same database to a second customer.
Sulekha.xls – source: www.sulekha.com – A bit unusual data scraping, an one-time use database for SMS and email marketing, instead of creating a saleable product containing all car models, all buildings, all of something.
I done few more databases but the customers told me to NOT publish on website, or they are in fields unrelated to topics covered by my website so even if published, they won’t get sales.