data mining Archives | Custom Web & Automation / Bot Development - PHP Developer

So a lot of people request data collection or mining jobs for Amazon, either utilizing the official AWS or MWS APIs available and/or a combination to scrape the site for information that the APIs do not provide. One of the biggest request I get is for seller product scraping, essentially scraping a sellers product catalog then acquiring the product data in order to put together a product database of their own for e-commerce sites, or sometimes just to get product images. While this can be accomplished via scraping, there are some 3rd party APIs that make this much easier. Depending on the amount you are scraping, the 3rd party resources which do cost some money, might be faster and more effective than simply scraping the data.

The second most requested task I get for Amazon is inventory level scraping for specific ASIN / Seller ID combinations. Whether it is for MAP pricing enforcement or to keep products competitively priced while other sellers have inventory levels in stock, there is no API that I am aware of that will allow you to get inventory levels from Amazon for another SellerID.

I recently created such an API for private use for some of my clients and I wanted to explain the infrastructure of services required to accomplish such a task.

First off, scraping requires servers, the more you want to scrape the more servers you will need. I prefer to use AWS EC2 instances for these tasks. You can even use spot instances along with Amazons SQS in order to load balance the scraping between as many instances as you want which considerably lowers the price. Whether you are scraping 50+ million records a day from Amazon or 1000, you will always need the following…

1) PROXIES PROXIES PROXIES

You will always need to use proxies. Premium proxies that are reliable are best, even more so private proxies, although shared proxies will work. Except to spend anywhere from $.75 – $2.50 per proxy a month. Know that every month you will have to rotate to a fresh set of proxies because eventually proxies get banned. If you are using dedicated proxies with IP blocks registered to know hosting providers you will not get much bang for your buck because these proxies are more likely to popup with Captcha’s and 503 http responses as well as the “Sorry something went wrong!” page. Amazon is incredibly good at keeping you from scraping their site and in my opinion they certainly could make it near to impossible but they do give you some wriggle room which can make it cost effective. The best proxies to use are residential proxies but those are quite expensive. I have clients who use as little as 10 proxies a month and others that use over 16,000 proxies a month. There are of course services that will allow you to buy ports or concurrent connections to a proxy IP and port that will automatically rotate every few minutes but unless you are paying for the high end services like Luminati, those can be problematic for server reasons. One, the IPs can rotate at any moment, which can mess up session data and cause a process to restart. If you have to be logged in for whatever reason, they don’t work too well either. It depends on the situation but for inventory scraping, its best to use a flat list of proxies in my opinion that you can refresh monthly.

2) Captcha breaking service

You will get a MUCH higher lifespan from your proxies by using Captcha breaking services. You can use open source solutions based on Google’s Tesseract OCR but honestly, for as little as a $1 per 1000 captchas its best to go for ultra quick and 95%+ solutions that cost money. It takes a lot of CPU power to break 1000s of captchas so honestly paying for a 3rd party services is likely cheaper in the long run unless you are doing several million pages a day. Expect that within a few days at best every new session per proxy request will require captcha breaking.

3) On-going maintenance

It is simply the nature of scraping. You cannot build a solution that will work as intended for long periods of time without Amazon making some tiny changes that break your scraper. Unlike APIs which stay the same for long periods of time scraping is a solution that requires consistent monitoring and maintenance to keep your solution performing as desired.

4) Don’t scrape more than you need too

A lot of people want to scrape things as fast as they can as often as they can just because. If there isn’t a specific need to scrape any particular data simply keep it simple and only scrape as much as you need to.

5) Servers

You need servers to handle the request, the data, the accompanying tasks, etc… I am not going to get into much detail here as every situation is different but definatly factor in the cost of servers.

6) Initial development

Finally, you need a developer to string all of this together. Most new clients asking for scraping services, they are really only thinking about the cost to get these data mining operations going. This is just one of the expenses. The bulk of the expenses in my experience are the above mentioned items.

View Post amazon, api, aws, data mining, inventory, MWS, scraping

I have recently finished work on a back-end service that is constantly running and checking for returns from Amazon, for over 30 seller accounts now but it could easily handle 1000. This bot monitors amazon accounts for all removal request that are made and then tracks those orders, collecting data on the products, the descriptions, images, sale prices, shipping information such as tracking numbers, which products will be in which shipment and how many units will arrive. This bot helps a company manage products that were originally sent to Amazon as Prime eligible items that did not sell or were returned or exchanged. When the seller has then returned, the bot will find all the products and gather all the details required to automatically track incoming packages, allowing for them to feed the data into their warehouse software, which will allow the team to prepare space for incoming products in mass. Then using the data collected it will then integrate with other software which will then automatically list the products for sale on alternative sales channels at prices and rates that are thought to be the ideal price for moving the units. The amount of time and effort that it would take to manually handle what this bot does over and over 24/7 without ever resting would take a small workforce. Instead, you have a computer running a background process automatically interfacing with multiple platforms built in multiple programming languages, connecting to a variety of systems all working together for a common goal.

Most of my scrapers, bots, data miners and such I can’t really talk about due to the confidentiality agreements I have with the companies I build them for. I really wish I could talk about some of them because some are just mind blowing awesome. This one I can talk about so I figured I would post about it as I am trying to keep my blog updated at least once weekly. As a developer, I am trying to get in the habit of blogging about problems and scenarios I run into working as a freelancer in the hopes that some of the information will prove to be informative and even helpful to someone at some point.

I don’t think that many people understand the value in incorporating automation into their business models. It is a shame that business owners and management aren’t aware of what kinds of benefits that accompany having a software developer analyze your day to day operations and create custom software to help reduce the cost of labor and to increase proficiency. Most of the smaller companies I work with (50 or less employees) tend to miss a lot of opportunities to take advantage of these benefits, especially since they are the ones who really need to make sure they are keeping their operating expenses at a minimum so they can focus on expanding their company instead of wasting resources on staff that perform the simplest of tasks that are repetitive and time intensive. I am going to be working with a new company starting the week after next and I will be helping a company that has been around for 14 years retool their entire standard operating procedure. After 14 years, so many processes have become inefficient, work-a rounds have been turned into habits. At the end of the day you have a system that is working but it is not as efficient as it could be. Currently, 16 employees are currently handling data entry tasks 8 hours a day 5 days a week. I will be sure to write up weekly reports over the coming weeks explaining my initial findings, the recommended course of actions, the priority level of the areas we intend to restructure, any 3rd party services we intend to use and an overall assessment of the level of productivity per dollar spent by the employer and then compare the results after the overhaul. I think this will be an interesting case study that will help me help anyone who reads it to help see the benefits of automation. I am looking forward to using PHP, C# and Python on this project, so look forward to a diverse set of code snippets and ideas put to practical use and their effects on a business in real time.

I was going to post up some MWS API PHP code, but right now in its current state I need to add some commenting and such before I feel like showing it to others. I will update this post when I am finished with the additional documentation. I think a lot will find it useful for interacting with the MWS API. It will also show how the process of requesting reports and pulling reports works and how to best avoid throttle limits. Until then happy coding everyone!

View Post automation, bots, c#, data collection, data mining, OCR, parsing, php, python, web scraping

Posts tagged "data mining"

Amazon Inventory Scraping

Amazon Bot Return Management & Automation