api Archives | Custom Web & Automation / Bot Development - PHP Developer

So a lot of people request data collection or mining jobs for Amazon, either utilizing the official AWS or MWS APIs available and/or a combination to scrape the site for information that the APIs do not provide. One of the biggest request I get is for seller product scraping, essentially scraping a sellers product catalog then acquiring the product data in order to put together a product database of their own for e-commerce sites, or sometimes just to get product images. While this can be accomplished via scraping, there are some 3rd party APIs that make this much easier. Depending on the amount you are scraping, the 3rd party resources which do cost some money, might be faster and more effective than simply scraping the data.

The second most requested task I get for Amazon is inventory level scraping for specific ASIN / Seller ID combinations. Whether it is for MAP pricing enforcement or to keep products competitively priced while other sellers have inventory levels in stock, there is no API that I am aware of that will allow you to get inventory levels from Amazon for another SellerID.

I recently created such an API for private use for some of my clients and I wanted to explain the infrastructure of services required to accomplish such a task.

First off, scraping requires servers, the more you want to scrape the more servers you will need. I prefer to use AWS EC2 instances for these tasks. You can even use spot instances along with Amazons SQS in order to load balance the scraping between as many instances as you want which considerably lowers the price. Whether you are scraping 50+ million records a day from Amazon or 1000, you will always need the following…

1) PROXIES PROXIES PROXIES

You will always need to use proxies. Premium proxies that are reliable are best, even more so private proxies, although shared proxies will work. Except to spend anywhere from $.75 – $2.50 per proxy a month. Know that every month you will have to rotate to a fresh set of proxies because eventually proxies get banned. If you are using dedicated proxies with IP blocks registered to know hosting providers you will not get much bang for your buck because these proxies are more likely to popup with Captcha’s and 503 http responses as well as the “Sorry something went wrong!” page. Amazon is incredibly good at keeping you from scraping their site and in my opinion they certainly could make it near to impossible but they do give you some wriggle room which can make it cost effective. The best proxies to use are residential proxies but those are quite expensive. I have clients who use as little as 10 proxies a month and others that use over 16,000 proxies a month. There are of course services that will allow you to buy ports or concurrent connections to a proxy IP and port that will automatically rotate every few minutes but unless you are paying for the high end services like Luminati, those can be problematic for server reasons. One, the IPs can rotate at any moment, which can mess up session data and cause a process to restart. If you have to be logged in for whatever reason, they don’t work too well either. It depends on the situation but for inventory scraping, its best to use a flat list of proxies in my opinion that you can refresh monthly.

2) Captcha breaking service

You will get a MUCH higher lifespan from your proxies by using Captcha breaking services. You can use open source solutions based on Google’s Tesseract OCR but honestly, for as little as a $1 per 1000 captchas its best to go for ultra quick and 95%+ solutions that cost money. It takes a lot of CPU power to break 1000s of captchas so honestly paying for a 3rd party services is likely cheaper in the long run unless you are doing several million pages a day. Expect that within a few days at best every new session per proxy request will require captcha breaking.

3) On-going maintenance

It is simply the nature of scraping. You cannot build a solution that will work as intended for long periods of time without Amazon making some tiny changes that break your scraper. Unlike APIs which stay the same for long periods of time scraping is a solution that requires consistent monitoring and maintenance to keep your solution performing as desired.

4) Don’t scrape more than you need too

A lot of people want to scrape things as fast as they can as often as they can just because. If there isn’t a specific need to scrape any particular data simply keep it simple and only scrape as much as you need to.

5) Servers

You need servers to handle the request, the data, the accompanying tasks, etc… I am not going to get into much detail here as every situation is different but definatly factor in the cost of servers.

6) Initial development

Finally, you need a developer to string all of this together. Most new clients asking for scraping services, they are really only thinking about the cost to get these data mining operations going. This is just one of the expenses. The bulk of the expenses in my experience are the above mentioned items.

View Post amazon, api, aws, data mining, inventory, MWS, scraping

I have been doing a lot of automation and data collection via API / scraping and I have recently found an amazing toolset to allow me to use PHP to create scripts I would normally look to Ruby, Python or ReactJS to handle. It is called ReactPHP. There is also Amp which is similar, but I prefer ReactPHP because I was able to accomplish a LOT very quickly.

With ReactPHP you can create scripts that run indefinitely, can set concurrent connections for API request, setup a queue, setup timeouts, setup parallel downloads, etc… It is an absolute game changer for developers that need something that can handle batch processing, multiple services, true multi-threading, I/O tools for filesystem, database read / writes, etc…

I have been collecting data on products using Amazon’s Merchant Web Services. They throttle you at 720 request per hour, per account and a max of 17280 request per day, per account. You can make a request every 2.5 seconds and with ReactPHP I could manage multiple accounts, proxies, request, dynamically load balancing as accounts are added or removed. With ReactPHP I am able to literally max out the number of connections that can be made with a simple script. I guess it isn’t so simple when you first get started, but if you hang in there it starts making a lot of sense.

There are some great examples on GitHub showing you how to use many aspects of ReactPHP. My first script was a multithreaded image downloader that would PHash’ing / and do some image comparison, which was faster than Python, Ruby and NodeJS. I was able to process 1 million images in 19minutes. Ruby was able to download the images in that time, but ReactPHP did so much more. I will admit, my development environment is optimized for PHP-FPM version 7.2.7, so maybe that is why PHP was able to beat out the other languages. That and I didn’t try and optimize those scripts like I did with ReactPHP. Nevertheless, it means that from now on I don’t have to turn to other development environments to achieve what I have always felt PHP was missing… True multi-threading, asynchronous and parallel support. With tools such as RabbitMQ you can do so much more. No more scripts, that execute. You can create services now. In PHP 7.2.x, PHP got some goodies added in the way of basic CLI only support for multithreading but its still limited. With the introduction of ReactPHP into my skill sets and tool set, I can create almost any process and scale its speed indefinitely. You certainly have to think smaller, breaking things down and you spend a lot of time creating out classes and processes that chain together as you can only initiate a single loop at a time per script. So you have to break your processes down into little pieces but the end result is nothing less than spectacular.

Due to my agreement with the company I work with I am not at liberty to go into details or provide any kind of scripts. Sorry! I would just like all PHP developers out there that deal with process automation that ReactPHP is a great framework that might help you out in ways you never thought PHP was possible of doing.

Check out the best starting point: http://sergeyzhuk.me/reactphp-series and don’t forget to check out his GitHub Repo which is full of free goodies.

View Post amazon, api, automation, merchant web services, multi-threading, MWS, reactphp, synchronous

Posts tagged "api"

Amazon Inventory Scraping

ReactPHP – multi-threaded processing using PHP?