So a lot of people request data collection or mining jobs for Amazon, either utilizing the official AWS or MWS APIs available and/or a combination to scrape the site for information that the APIs do not provide. One of the biggest request I get is for seller product scraping, essentially scraping a sellers product catalog then acquiring the product data in order to put together a product database of their own for e-commerce sites, or sometimes just to get product images. While this can be accomplished via scraping, there are some 3rd party APIs that make this much easier. Depending on the amount you are scraping, the 3rd party resources which do cost some money, might be faster and more effective than simply scraping the data.
The second most requested task I get for Amazon is inventory level scraping for specific ASIN / Seller ID combinations. Whether it is for MAP pricing enforcement or to keep products competitively priced while other sellers have inventory levels in stock, there is no API that I am aware of that will allow you to get inventory levels from Amazon for another SellerID.
I recently created such an API for private use for some of my clients and I wanted to explain the infrastructure of services required to accomplish such a task.
First off, scraping requires servers, the more you want to scrape the more servers you will need. I prefer to use AWS EC2 instances for these tasks. You can even use spot instances along with Amazons SQS in order to load balance the scraping between as many instances as you want which considerably lowers the price. Whether you are scraping 50+ million records a day from Amazon or 1000, you will always need the following…
1) PROXIES PROXIES PROXIES
You will always need to use proxies. Premium proxies that are reliable are best, even more so private proxies, although shared proxies will work. Except to spend anywhere from $.75 – $2.50 per proxy a month. Know that every month you will have to rotate to a fresh set of proxies because eventually proxies get banned. If you are using dedicated proxies with IP blocks registered to know hosting providers you will not get much bang for your buck because these proxies are more likely to popup with Captcha’s and 503 http responses as well as the “Sorry something went wrong!” page. Amazon is incredibly good at keeping you from scraping their site and in my opinion they certainly could make it near to impossible but they do give you some wriggle room which can make it cost effective. The best proxies to use are residential proxies but those are quite expensive. I have clients who use as little as 10 proxies a month and others that use over 16,000 proxies a month. There are of course services that will allow you to buy ports or concurrent connections to a proxy IP and port that will automatically rotate every few minutes but unless you are paying for the high end services like Luminati, those can be problematic for server reasons. One, the IPs can rotate at any moment, which can mess up session data and cause a process to restart. If you have to be logged in for whatever reason, they don’t work too well either. It depends on the situation but for inventory scraping, its best to use a flat list of proxies in my opinion that you can refresh monthly.
2) Captcha breaking service
You will get a MUCH higher lifespan from your proxies by using Captcha breaking services. You can use open source solutions based on Google’s Tesseract OCR but honestly, for as little as a $1 per 1000 captchas its best to go for ultra quick and 95%+ solutions that cost money. It takes a lot of CPU power to break 1000s of captchas so honestly paying for a 3rd party services is likely cheaper in the long run unless you are doing several million pages a day. Expect that within a few days at best every new session per proxy request will require captcha breaking.
3) On-going maintenance
It is simply the nature of scraping. You cannot build a solution that will work as intended for long periods of time without Amazon making some tiny changes that break your scraper. Unlike APIs which stay the same for long periods of time scraping is a solution that requires consistent monitoring and maintenance to keep your solution performing as desired.
4) Don’t scrape more than you need too
A lot of people want to scrape things as fast as they can as often as they can just because. If there isn’t a specific need to scrape any particular data simply keep it simple and only scrape as much as you need to.
5) Servers
You need servers to handle the request, the data, the accompanying tasks, etc… I am not going to get into much detail here as every situation is different but definatly factor in the cost of servers.
6) Initial development
Finally, you need a developer to string all of this together. Most new clients asking for scraping services, they are really only thinking about the cost to get these data mining operations going. This is just one of the expenses. The bulk of the expenses in my experience are the above mentioned items.
View Post