The science behind Amazon’s A9 algorithm disambiguated
In 2016 the paper Amazon Search: The Joy of Ranking Products was published by Daria Sorokina and Erick Cantu-Paz. It has since served as the ultimate resource by which the layman can understand how Amazon’s relevance engine, and by proxy the ranking algorithm, works.
We hold that this resource still accurately describes the functions and mechanisms of Amazon’s ranking technology, despite its age. The reason is because since then, only two other relevant papers have been published (Multi-Objective Relevance Ranking and Multi-Objective Ranking with Constrained Optimization).
Both of those papers merely build off of the work from the Joy of Ranking by optimizing the algorithm’s performance under multi-objective constraints (basically, some categories and some buyer behaviors must be treated differently as their objectives will be different). This involves optimizing gradient-boosted trees in the relevance algorithms, which is a topic for machine learning scientists, and not Amazon sellers.
Get Our Internal Amazon Listing Optimization Operating System HERE 👇
As such, we wish to disambiguate the topic as presented by Sorokina’s paper, as well as discuss implications from the perspective of sellers with decades of combined experience.
Part 1: How the Algorithm is “Programmed”
There are actually a number of different algorithms that work in-tandem with one another to create the rich complexity that is relevance and ranking on Amazon.
These algorithms are “programmed,” or rather “trained” using customer behavior as well as internal metrics. Essentially, these are machine learning algorithms that are given mountains of historical data from which they are able to make predictions of future action.
Ranking models are the thing that presents the sorted information when a shopper enters a query on Amazon. There is one model per category per marketplace on Amazon. There are over 100 machine learned models, each optimized with “gradient boosted trees.” Each model has 200 trees.
The mechanism that facilitates ranking is the gradient boosted trees, but for rank problems a pairwise objective is used. This essentially takes the relevance score of a pair of listings and compares them.
All of that to illustrate that the relevance engine is robust and complex, accounting for as many variables and behaviors as possible.
Ranking models are trained using behavioral features, which are customer actions. These features largely impact exactly how rank sort is presented on a browse page.
Aside from customer behavior, another factor in training the ranking models is internal metrics such as conversion rate, basket size, revenue, etc. These are more the hard numbers that prove what worked (or didn’t) as well as steers action toward what will benefit Amazon’s bottom line.
It has been often wondered whether price or revenue had any impact on rank, but after this paper it is not only clear but explicitly stated. Amazon absolutely tailors their customer experience to also benefit revenue (to no one’s surprise).
One of the most fascinating aspects of how Amazon determines relevance is its use of behavioral cues. Being the largest online retailer in the US, pushing record sales by inventing a shopping holiday (Prime Day) and innovating the most seamless checkout experience in existence, Amazon has the most robust data set on customer behavior there is.
It would be foolish to not use that data to optimize the shopping experience, and therefore optimize the number of checkouts the marketplace gets.
As such, the ranking models are trained with features such as query specificity (the specific searched query and any notes as to shopper intent) as well as “customer status” (seemingly a reference to customer trust score).
The reason behavioral features weight relevance so much on Amazon, much more so than on web search, is because several listings on the platform could have similar descriptions that would fit the same query. Yet, still some listings will be more popular than others. The logic is that those more popular items “should be ranked higher.” It’s only by analyzing behavioral actions can the rank sort deliver the most appropriate rankings for both the shopper and Amazon.
Other mechanisms of the “relevance engine,” as we like to call it, are:
- Click through rates capture both position bias (accounting for higher position gathering higher clicks by default) as well as typical relevance at that position. Bias correction is also adapted day to day.
- In an effort to separate product type from modifiers in queries, each query is treated as a noun phrase, with the head of the phrase being treated as the product type and all other words treated as modifiers. This is largely done through detected product types in the queries and in product descriptions (i.e. listing optimization is critical for this reason primarily).
- Behavioral features can also shift the models from category to category (as fashion shoppers react differently to the search experience than kitchen shoppers, for example).
*A quick note on general search:
Ranking models provide search order within categories. Amazon recognizes, however, that most customers use the “search all” feature on the homepage. So Amazon has created a blended scoring mechanism in an effort to offer search results from multiple categories that may be relevant. Results are offered based on the probability of that relevant search being intended for that category. 90 day click data is used to help determine these probabilities.
Part 2: Features and Labels
Seemingly counterintuitive, the most purchased product for a query isn’t necessarily the most relevant for that query. The example of this given in the paper is someone searching for the term “diamond earrings.”
The number one seller is often the cheap, cubic zirconium version of the earrings, however it would not be acceptable to present these as the top option in a search for diamond earrings. To combat this, relevance algorithms are given features and labels to tease out a customer’s intention for using a query.
The ranking models start with 150 possible features, which are whittled down to just 50 based on scoring above random. From there the most applicable 20 are used to create the final rank sort for a query.
An example of a feature given in the paper is “days_since_release” which is only applicable to movies. Undoubtedly this feature is used to help determine whether a query refers to an older or newer version, or whether the movie exists yet at all.
Another example is “is_prime_benefit” which helps determine whether a video is available as a benefit for a Prime user (and thus would be relevant to certain searches and show higher in search queries as it enhances the user experience).
Other features include behavioral actions such as clicks, purchases, add-to-carts, etc. These are all weighted, but differently based on the model (category, known user intent, etc). The reason is because some category shoppers act differently than others.
A good example of this is a search for the query “iPhone.” The number one clicked result is the latest iPhone, yet that is often not the most purchased. Why? Because it is at its highest price and many shoppers will wait for a price reduction. Actually, the most purchased product for that query is the lightning cable. This is why not every model can be trained to prioritize purchases.
To assist with feature selection data logs of customer actions are used and “labels” are applied to listings. These are binary in nature, providing only positive or negative labels. Positive labels are based on action, whereas negative labels are based on inaction.
For example, someone submits a query and clicks on a listing. That listing got a positive label for the click. However, the other listings that received a page impression (the customer visibly could see it from the search page) but did NOT get a click, those got a negative label.
Listings that were added to cart also got positive labels, whereas listings that were clicked but not added to the cart would receive a negative label. This is how behavior is sorted and tracked so that the proper features are selected and used to ensure each category of customer receives the expected search experience.
Part 3: Match Sets and Cold Start
Query results are delivered by presenting two types of products; standard and behavioral. Standard matches are textual matches. Behavioral matches are based on products shoppers interacted with in past queries.
The example given in Sorokina’s paper was a search for a movie DVD before it is released. Obviously that DVD cannot be presented since it doesn’t exist yet. When the shopper clicks other movie titles within the same session, those are used to build behavioral models for future query results.
In the above example, the different types of products shown for a query are the “match sets.” The behavioral model results built specifically for queries with little to no relevant product matches, in this example, are what’s called the “cold start.”
Cold start is what has become commonly known in the Amazon industry as the “honeymoon period.” This is a period of time where Amazon’s relevance engine must draw solely on listing information and customer clicks to determine relevance before it has the data it needs.
Every new listing has a cold start, but the amount of time necessary to build relevance for specific keywords will vary based on existing behavioral models. The literature does suggest, however, that seven days is typically the timeframe the algorithms need to fully assess data and determine relevance for most keywords.
Part 4: Implications for Sellers
To recap, listing sort order, otherwise known as keyword ranking, on Amazon is determined by relevance. Relevance is determined by ranking models. Ranking models are determined per category per marketplace by behavioral features, customer status, and listing data. Features are determined by behavior and labels.
That details exactly how the ranking algorithm works, but how can that information be used?
Well, first it is important to recognize what won’t work. Due to the bias-control it is much harder to manipulate rank position with automated clicks. Due to the behavioral analysis it would also be difficult to manipulate rank with automated actions such as add-to-cart.
The reason for this is because every click, a2c, purchase, or other action is modeled and an expectation is set for every category (and even some sub-categories). And finally, because customer status is a feature, fake orders and poor customer pools will be ineffective as well.
However, note that the paper outlines specifically that the ranking models detect product types from listing information. This means that what WILL work is stellar listing optimization. To further that point, positive labels will be attributed to listings that get clicks and other actions, which will increase with optimized imagery.
Essentially, all of the positive features needed for relevance can be increased with imagery and SEO focused copy.
Aside from optimization, reading between the lines we can also conclude that being direct and unambiguous with keyword selection is advantageous. Basically, the more clearly your product is represented within a category by a keyword the easier it is to gain relevance for that keyword.
Lastly, it is good to know the types of expected behavior for your product’s category. For example, you should be aware, if you sell an expensive product, of whether or not there are keywords people use just for research, versus purchase intent keywords.
And of course, always keep in mind that the more products you can get customers to add to their baskets and the more revenue you can earn for Amazon, the more likely you are to win favor with the algorithms, all other things being equal.
If you need help with crafting the most optimized listing for both Amazon’s relevance engine and your prospective customers, reach out to us. Signalytics accesses the most robust data tools to ensure greater conversions and profits with incredible optimizations.