top of page

21 data science systems used by Amazon to operate its business

Most of these systems are or should be used by most large organizations, for business optimization.

Here's my list:

  1. Supply chain optimization (I). Sites selection for warehouses to minimize distribution costs (proximity to vendors, balanced against proximity to consumers). How many warehouses are needed, and what capacity each of them should have

  2. Supply chain optimization (II). Selection of optimal routes, schedules, and products groupings, to minimize delivery costs (using graph theory)

  3. Supply chain optimization (III). Optimization of gas purchases for delivery trucks.

  4. Supply chain optimization (IV). Minimize time spent by drivers in traffic jams (requires traffic prediction) while optimizing delivery speed, gas usage and other factors (better be stuck 20 minutes in a traffic jam than a costly detour, or departing later?)

  5. Pricing and profit optimization (per-product price elasticity studies needed; may require products to be aggregated in categories, to create buckets that yield statistical significance)

  6. Fraud detection for credit card transactions (use decision tree methods). Also detect criminal behavior taking place on AWS for instance. Detect system intrusions and hacking attempts (to prevent stealing data such as credit card data, or prevent employee ID theft, or other malicious activity).

  7. Fake reviews detection. They still have tons of progress to make in this area: at least categorizing users would be a first step, so that buyers know what kind of user produced a specific review; then relevancy algorithms must be used to assess how relevant a review is for a specific product, knowing that most likes and stars assigned by users are biased - partly because most normal people don't have time or interest to write a review. Indeed, fake reviews is a lucrative business taking advantages of inefficiencies in platforms such as Amazon. The best solution is to remove user-generated reviews and replace them, for each product, by number of sales over the last 30 days.

  8. Taxonomy creation to categorize products, produce and maintain great catalogs, and help with user searches: this is a gigantic clustering problem, that can be done efficiently using tagging and indexing algorithms

  9. Smart search engine technology (based also on taxonomy discussed above) to help users find what they want to buy quickly

  10. Multivariate testing, for instance to find out which version of a search engine increases sales, everything else being constant

  11. Recommendation engine (and detection of artificial purchases aimed at fooling these algorithms)

  12. Customer segmentation, churn analysis, using survival analysis models, to increase marketing and advertising efficiency

  13. Advertising optimization, including automated bidding on Google Adwords for millions of keywords in real time, most having no historical data (use bucketasition techniques to group keywords in buckets that have real predictive power); algorithms to identify millions of keywords worth purchasing, based on expected yield. Advertising mix optimization and attribution modeling. SEO and SEM.

  14. Inventory forecasting: how many copies of each product should they keep at anytime in any warehouse, to optimize a few metrics (profit, product decay if perishable, delivery time etc.)

  15. Sales forecasting broken down by category / location based on tons of factors that need to be identified first, using feature selection algorithms (including economic forecasts; requires time series techniques)

  16. HR analytics: who to hire, how to score candidates to better predict who will succeed; detect employees at risk of leaving or committing fraud; optimize purchase of office supplies; optimize employee compensation given several market constraints; optimize travel expenses

  17. Real estate analytics

  18. Software/hardware system analytics: minimizing/predicting server crashes, optimizing redundancy with budget constraints, optimizing load balance and bandwidth usage; how many servers must be purchased, how frequently should they be replaced. Also, create email alert systems, automatically prioritize messages and select recipients. Also manage external email campaigns (delivery rate, open and click rate optimization).

  19. Payments analytics. Optimization of payments: to authors, vendors, publishers, while maximizing profits and minimizing publisher / author / vendor churn; vendor and publisher selection algorithms.

  20. Competitive analysis: automatically process billions of comments posted by users on social networks about Amazon, its competitors, and new trends; summarize this data, take actions based on the insights derived from this daily / hourly / real-time, automated analyses

  21. Tax engineering

  22. Ad Relevancy Algorithm to select and rank Ads to be displayed on a particular webpage to a particular visitor, to maximize some yield metrics (click through rate): click here to see how it works.

Data scientists, sometimes entire teams, are devoted to each of these problems.

Note that a number of these problems are solved using the simplex algorithm, and are typically supply chain optimization problems, a subset of operations research, which itself significantly overlaps with data science.

Some retention / churn problems are sometimes solved using Markov chain models, which is also a technique used extensively by operations research professionals. Many of the very complex problems benefit from Monte-Carlo simulations, another operations research technique.

Some opinions expressed in this article may be those of a guest author and not necessarily Analytikus. Staff authors are listed

287 views0 comments


bottom of page