Best Seller Product Calculation, from Business Problem to Machine Learning Approach

Muchid Ariyanto
5 min readDec 17, 2020

--

In an online-based business, especially e-commerce, it can be difficult for the product owner to determine which products are the most suitable to be displayed on their homepage, especially because the products displayed must be quality products.

As a data analyst or data scientist, we have the responsibility to build logic to determine which products deserve to be displayed on the homepage. In this article, we will discuss one method to determine how to present the best product using a machine learning approach.

“Sales rank is calculated based on all-time sales of an ASIN/Product where recent sales are weighted more than older sales.” — Amazon.

As quoted by Amazon, the main factor in determining the best product is the number of sales the product gets. Based on that understanding, a model will be created to score each product. Following are the variables used in determining the score of a product:

  • Number of sales in the last 1 day
  • Number of sales in the last 3 days
  • Number of sales in the last 7 days
  • Number of sales in the last 14 days
  • Number of sales in the last 21 days
  • Number of sales in the last 28 days
  • The number of all-time sales.

From the variables above, we will look for the weights as multipliers of each of the above variables. The method used is simple linear regression one by one, per variable, to the source of sales today. Here is an illustration mathematically:

The same formula is applied one-by-one for the 7 independent variables that have been determined above so that the weights will be obtained for each of the independent variables. After that you will get a formula to determine the score for each product, as follows:

The next problem is how do we determine that the score we get is good and can be used for this logic. Therefore we will observe the correlation between the scores obtained per product against the number of sales on that day. If today is August 10, 2020, the historical data will be seen before August 10, 2020, where the number of sales on August 9, 2020, is the last 1 day data for the score on August 10, 2020.

The graph below is the correlation between the product score on August 10, 2020, and the number of sales on that date, based on historical data before August 10, 2020, for products in the health category. It can be seen that the higher the score, the higher the sales you get.

Dummy Data for Correlation between Score Generated and Number of Sales in the Next Day

What if there are two products whose number of sales are identical, how does this logic determine which product is the best?

In e-commerce, several attributes can be used, such as the number of reviews, star rating, and CVR (number of sales/number of impressions). We can use these attributes as a unique number to the previous score.

For reviews, we know that the number of reviews ranges from 0 to infinity, if we use these numbers literally, then there is a possibility that the product with a very large number of reviews will be superior to products that should have a high score without using it. For attribute cases that have an unlimited number of possibilities, we can use the logarithmic function to normalize the numbers, so that extreme numbers can be handled properly. Many other functions can be used, one of which is the S-Curve which will clearly limit the score from 0 to 1. The following is an illustration of the logarithmic function that can be used:

Dummy Data for Number of Review with Logarithmic Function

Meanwhile, because we know that star rating usually has a limit value from the start, there is no need for us to standardize the value again. And we know that CVR must have a number <= 1, so we also don’t need to standardize this number.

Furthermore, when the attributes (logarithmic number of reviews, star rating, and CVR) we weigh by very small number so that it does not interfere with the main formula based on the number of sales earlier.

To make it easier to understand, if today is the 10th of August 2020, for product A with the following conditions:

  • Has 10 sales on 9 August 2020
  • Has 15 sales in the last 3 days before 10 August 2020
  • Has 18 sales in the last 7 days before 10 August 2020
  • Has 20 sales in the last 14 days before 10 August 2020
  • Has 30 sales in the last 21 days before 10 August 2020
  • Has 31 sales in the last 30 days before 10 August 2020
  • Has a total sales of 50 all the time
  • Has a total of 20 reviews
  • Has a star rating of 4
  • Has a CVR of 0.08

So the score for product A on 10 August is:

Special thanks to Zahra Alya has reviewed the writing in this article!

--

--

Responses (3)