7 Recommendation Systems

The application we will be referring to can be found here: https://datasupport.site

7.1 Recommending similar books. Content based filtering

First we will create the algorithm that is used to recommend ‘similar books’ to the one the user just bought. That algorithm will simply look at how similar the book we just bought is, to other books available, according to the given attributes (author, genre, age). This approach is called Content based filtering.

7.1.1 What is similarity?

For us humans, distinguishing similar items is a simple task. We look at the properties of the item (in this case the book’s author, theme and age) and we can group together books by the same author, same theme and similar age. If we want a machine to perform that task we need to generalise this intuition and describe it using a set of clear instructions.

In machine learning, similarity if often described by how far two items are from each other. Distance is an effective metric as it quantifiable and clear. This distance would be in terms of the given attributes.

To understand this, we can imagine plotting all of our points, using their attributes as coordinates. The closer each point is to another the more ‘similar’ they would be. For example if we only used the \(x\) axis (a straight line), whose values represent the age of each book. We could then mark where on that line each book is according to its age. Marks that are close to each other would show us books that have similar age. To define ‘similarity’ in a more realistic manner we need to use the rest of attributes. So instead of just a line we plot our points in a graph where each axis (known as dimension) represents an attribute.

This distance between points can be measured using various techniques such as:

  • Euclidian distance. Similar to computing distances in 2 dimensional distances, we use the co-ordinates of the points to draw a right angled triangle with two known sides and compute the unknown, which represents the distance, using the Pythagorean theorem.

  • Correlation distance. Basically measures how ‘in-sync’ (do they both go up or down) the deviations from the mean of each point are for each user. Users that have rated the same product higher than their average and the same products lower than their average are considered close to each other.

  • Hammer’s distance. Simply measures the percentage of agreement. How many times two users gave the exact same rating.

7.1.2 How can we find similar books?

As we discussed above, similarity is described in terms of distance between points. So we would have to ‘plot books over the axes of age, author and theme’. and find which books are close to each other. Author and theme, are however categorical variables (non-numeric), so we can not easily plot them in the way we want to. An simple way to bypass this, would be to assign a numeric ‘dummy’ variables, representative of each value a categoric attribute takes on. For example, theme love story would be \(1\), theme pirates \(2\), aliens \(3\), drama \(4\) and so on. Once this is complete we can plot the books in terms of theme using those dummy variables. The books with the same theme would be close to each other, correctly describing similarity. What is the problem?

The problem is on the order of dummy variables. When we plot this our model will decide that love story (number \(1\)) is more similar to pirates (number \(2\)), as they are closer to each other, than let’s say drama which happens to be number \(4\). We would have to give an order that makes logical sense, which can be exhaustive. Sometimes such an order may not even be meaningful or very challenging, in the example of authors what would make an author be more similar to another author?

For that reason, we have chosen to work with Hammer’s distance for the categorical variables (simply looks whether or not two values are the same) and absolute distances for the age. We have created our own algorithm, which follows our intuition of what similarity is, let’s see how it works!


# First we load the dataset I have prepared, it contains the following
# observations:
# Books, which contain the book number (from 1 to 20), author (a, b, c, d), the
# genre (pirate, alien, love, drama) and how many years ago the book was written
books <- read.csv("books.txt")

# For every book that a user buys, we want to recommend 'similar' ones. So every
# time the user clicks 'Submit' to buy a specific book. We want to calculate the
# distances from his book to the rest of the books available and recommend the
# ones that are closer (efficiency considerations and optimisations are
# discussed later, for now let's keep things simple for the purpose of
# understanding the concepts).

# So the first step is to get the distances of this book to the rest. This is
# what this function does.

# It receives the book number of the book that was just bought as a parameter
get.distance <- function(bookNumber){
  # It returns a data frame containing various distance metrics
  df <- data.frame(CompBook=integer(),
                   hammingDist=integer(),
                   ageDist = integer(),
                   totalDist = integer()
                   )
  # First we capture the author, genre and age of the book that was bought,
  # since those values will be used to compare similarity
  author <-books[books$book==bookNumber,]$author
  genre <- books[books$book==bookNumber,]$genre
  age <- books[books$book==bookNumber,]$age


  # Then for every other book available we
  for (i in 1:20){
  # Capture its author, genre and theme
  authorComp  <-books[books$book==i,]$author
  genreComp  <- books[books$book==i,]$genre
  ageComp  <- books[books$book==i,]$age

  # Reset our distance metrics to zero
  hammingDistance <- 0
  ageDist <- 0
  totalDist <-0

  # Compute our Hamming distance
  # If they do not have the same author, we want the distance to increase so we
  # add one
  if (author != authorComp){
    hammingDistance = hammingDistance+1
  }
  # If they do not have the same genre, we want the distance to increase so we
  # add one more
  if (genre != genreComp){
    hammingDistance = hammingDistance+1
  }

  # For the age, we simply get the absolute difference of the age from the book
  # we just bough to the book we are comparing it to
  ageDist <- abs(age - ageComp)

  # It would not make sense to simple add our hamming distance with the age
  # distance, as the hamming distance can only go up to 1 for each value,
  # whereas the age can get up to very high values. Age would become far more
  # significant than the rest of the attributes. Instead we use domain knowledge
  # to interpret this value.

  if (ageDist > 10) {
  # If the age is greater that 10 years we add 0.5 to the distance (we assume
  # this has been chosen from domain knowledge and studies available)
    totalDist <- hammingDistance + 0.5

  # If it is only grater that 5 we add 0.25
  } else if (ageDist > 5){
    totalDist <- hammingDistance + 0.25

  # Otherwise total distance is just the one we have previously calculated, the
  # hamming distance
  } else {
    totalDist <-hammingDistance
  }

  # We just need to add our calculated metrics to the data frame we need to
  # return
  newdata <- c(i, hammingDistance,ageDist, totalDist)
  df<- rbind(df, newdata)
  names(df) <- c("CompBook", "hammingDist", "ageDist", "totalDist")

  }

  # And return them :)
  return(df)

}

# This function, is the one actually called when the user hits submit
getClosest3 <- function(bookNumber){

  # It makes use of the previous function to get the calculated distances from
  # all our points
  distDataSet <- get.distance(bookNumber)

  # It removes the distance of our point to itself, as we do not want to
  # recommend to the user the book he just bough (although technically it has
  # the smallest distance)
  distDataSet <- distDataSet[! (distDataSet$CompBook==bookNumber),]

  # We order the resulting distances in ascending order (smallest first)
  orderedDistances <- distDataSet[order(distDataSet$totalDist,
                                        decreasing=FALSE), ]

  # Choose the 3 first from the ordered data frame
  closest <- head(orderedDistances, 3)

  # and return them! :)
  return(closest$CompBook)
}

7.2 Recommending books that were liked by ‘similar’ users, Collaborative filtering

7.2.1 Similar users?

Similar users are just users that have provided similar ratings or feedback, purchased or previewed similar objects and so on. In our case, we will focus on users that have provided similar ratings. Although, our recommendations will be optimised with more and more data, it does not mean that we require every user to have rated every book for the algorithm to work.

The idea here is that we use the rating matrix (see image 2, its simply a table presenting the rating of every user) and try to fill in the gaps, from missing ratings. Our model assumes that if a rating is missing that user has not read the respective book. So it wants to predict what his ratings would have been if he had read it. If that prediction returns positive, the model will recommend that book to that user.

The model fills those gaps by looking at the ratings of other user’s that had other similar ratings. Again, similarity can be identified by the distance of a user to another, where their coordinates are the ratings for each book.

Let’s use an example to understand this. We want to recommend some books to reader A, reader A has already read a couple of our books and he has rated them. Reader A has not rated book1 and book 10, so we assume he has not read them and we want to see if it is worth recommending them.

First we find some other readers with ratings similar to reader A (from the rating matrix). Then we look at the ratings for book 1 and book 10 that were given by the group of similar users. For each book we get the average rating that was given by them. We can take a weighted average, weighted by how close each of those reader is to A, the closer the more significant. We have found that the average rating, given by readers close to A, is 2 for book 1 and 5 for book 10, so we recommend book 10 to reader A.

The main advantage of this approach is that we do not require information about the books themselves. This can be extremely valuable for cases where acquiring that information is exhaustive and maybe even expensive and slow. Describing each item often requires manual work by humans that are experts on the relevant field. The disadvantage is that we need feedback from the users in order to recommend items. Not all users give feedback and often users tend to give feedback only on extreme cases such as very good or very bad, leading to models missing out on valuable insight for a lot of products.


# We load the data, this is a list that I have prepared, it contains the user,
# the book and the rating per row.
# Each user has rated some of our books from 1 to 5 (where 5 is perfect)
# Not every users has rated every book
readersFeedback <- read.csv("readersFeedback.txt")

# From that data, we need to create the rating matrix as shown in the image. At
# the moment we have normalised data (user, book, rating) so we have from 1 to
# 20 entries for each user. This is not useful if we want to find the distance
# of each user using his ratings. The Ratings are the attributes that need to
# used as coordinates to plot each point.So we need to transform that data to a
# data that has one entry for each user, and 20 columns containing ratings, one
# for each book.

# First we make the 20 vectors containing the ratings for each user, those will
# become columns
for (i in 1:20){
  assign(paste("book",i,sep=""), readersFeedback[readersFeedback$book==i,]$rate)
}
# Also make a vector containing the readers numbers in order
readers <- c(1:15)

# Now we make the new data set
readers.data <- data.frame(readers, book1, book2, book3, book4, book5, book6,
                           book7, book8, book9, book10, book11, book12, book13,
                           book14, book15, book17, book18, book19, book20)

# This library contains the function knnImputation, which can fill in the
# missing ratings, using the process we discussed previously.
install.packages("DMwR")
library(DMwR)

# This is the function that is called every time a users submits a new rating on
# our website.
# It receives all the ratings as parameters (some may be null and this is fine)
getRecommendations <- function(book1, book2, book3, book4, book5, book6, book7,
                               book8, book9, book10, book11, book12, book13,
                               book14, book15, book16, book17, book18, book19,
                               book20) {

  # Since we don't have a log in page, we don't know who the user is. For now we
  # can just assume this ratings are coming from a new user.
  # So we need to add this new user number to our rating matrix
  newUserId <- nrow(readers.data) + 1

  # We capture his ratings in this vector
  newEntry <- c(newUserId, book1, book2, book3, book4, book5, book6, book7,
                book8, book9, book10, book11, book12, book13, book14, book15,
                book16, book17, book18, book19, book20)

  # Then we add his ratings and umber to the existing matrix
  newMatrix <- rbind(readers.data,newEntry)

  # We fill in the gaps from missing ratings, as explained before using the
  # function provided by the library
  completeMatrix <- knnImputation(newMatrix,
                                  k = 3,
                                  scale = T,
                                  meth = "weighAvg",
                                  distData = NULL)

  # Capture the users ratings after we filled in the gaps
  readersRatingComplete <- completeMatrix[completeMatrix$readers == newUserId,]
  # Capture the books that the user has not rated
  readersRatingsMissing <- which(is.na(newEntry))

  # Recommend books that the user has not rated and for which the predicted
  # ratings are more than 3.4.
  recommended <- integer()
  for (i in 2:20){
  if (i %in%  readersRatingsMissing){
    if(readersRatingComplete[i]>3.4){
      recommended <- c(recommended, readersRatingComplete[i])
    }
   }
  }

  # Return the books that satisfy that :)
  return(recommended)

}

7.3 Recommending items that are often bought together (mining item association rules)

We want to look at the transaction history and recommend items that users tend to buy together or over a short period of time. This approach to recommendation systems is called association rules learning. Given that a user just bough an item we want to find out how likely he is to buy other ‘associated’ items and recommend the ones we are most confident that the user will be interested in.

To understand how this works, lets use the example where we want to find out if we should recommend book 2 to a user that just bought book 1 (only by looking at the transaction history of our shop). In other words, we want to find out if the association rule of buying 2 given that we bough 1 is strong enough.

To measure the strength of ‘association’ between some items, we se use the following metrics:

  • First we identify what proportion of all the transactions include both book 1 and book 2, this is called the SUPPORT of this rule. A high support means that there are a lot of transitions were both items have been bought together.

  • This is some evidence that there is a strong relationship. However there is a chance that book 2 just happens to be very popular is found in various transactions, including the ones where book 1 is also bought. We want to be able to deal with such cases, this is what CONFIDENCE measures. We want to measure how often book 1 and 2 are sold together as opposed to how often book 2 is sold. This is given by the ratio of the proportion of all transactions that included both 1, 2 (support) over the proportion of transactions that include book 2. It is nothing more than the conditional probability of 2 given 1.

  • Last we want to find out how much the probability of a user buying book 2 has increased once he bought book 1. This is called LIFT. We know the original probability of buying book 2, is just the proportion of all transactions that include book 2. The probability of buying 2 given that we have just bought 1 is what we just measured above, the confidence. If we divide the original probability of buying 2 to the conditional probability of 2 given 1, we can measure what is called Lift. A hight lift measurement means that the likelihood of the reader buying 2 once he bought 1 has increased.

For an association rule to be considered we want to have a hight Support Confidence and Lift.

We now have an understanding of how we measure ‘association’ between products. But how do we go about selecting which products to measure that association for, in the first place. Well one approach is to brute force it. Measure all the associations for every 2 possible pairs between our products, then the 3 pairs, the 4 pairs until we reach n pairs of products that are possible. From all those measurements we then pick the associations with the highest strengths. This is clearly computationally expensive.

There are various approaches to this problem, a popular one, which we will use is the Apriori algorithm. This algorithm investigates associations incrementally, it starts from single variables, pairs of 2, 3 and so on. Every time it checks if the pairs satisfy a minimum support value, the pairs that do are used to generate rules. It then moves to the next pairs, using only the items that were left out previously. It repeats until no items are left, that can satisfy that minimum support threshold.


# This package will help us implement Apriori to mine association rules
install.packages("arules")
library(arules)

# Our first step is, as always, to load the data. I have prepared a transaction
# history of 43 transactions, in a normalised 'single' transactionId-item
# format. We want R to recognise the file as a transaction history, in order to
# be able to implement Apriori with ease. There are two formats that R can work
# with for transactions. One is single, the one we use (you can view the file
# for details), where each entry contains the transaction id and the item
# bought, if there are multiple items bought in the same transaction, there
# would be multiple entries with the same transaction id.
# The alternative is basket format, where each entry contains the id and all the
# products bought in that transaction in a single entry.
trans <- read.transactions("transactionsNoUserName.txt",
                           format = "single",
                           sep = ",",
                           cols = c(1,2))

# Now we can implement Apriori, we can set the threshold for support and
# confidence as shown bellow:
assoc_rules <- apriori(trans,
                       parameter = list(sup = 0.03, conf = 0.6,target="rules"))
# We can inspect rules that were found
inspect(assoc_rules)


# Last we only need to implement those rules. This is the function that is
# called every time a user submits a book purchase.
# Since the user is only buying one book, we only only the rules that are
# relevant to single item purchase.
useRules <- function(bookNumber){
  message("/rules")
  switch(
    toString(bookNumber),
    "1"=2,
    "3"=1,
    "4"=2,
    "12"=11,
    "16"=c(17,18),
    "17"=2,
    "18"=17,
    "19"=18,
    {0}
  )
}

7.4 Further Discussions:

7.4.1 Optimisations

Although our models are currently doing their job pretty good, they would not scale well. For example, if the users increased from 15 to 100.000 and our books from 20 to 100.000.000.000 we would be having a lot of trouble. Let’s see why…

At the moment, our content based filtering algorithm requires all the attributes of the book that we just bough to be checked against all the attributes of all the books that are available. There could be countless books, and in a more complex scenario the attributes we are comparing them to, would not only be three (e.g. we might include book title, popularity, publisher, city and so on). Why might really need check every attribute of every book, but we might be able to reduce the percentage of books we want to compute the precise distance to. For example if a user just bought a book about aliens, is it really worth comparing that book to love stories? Could we before applying the algorithm to every single book, filter out ones that we think are completely irrelevant. Even better could we have built some pre-defined groups of books that share similar attributes and use them as a guide.

In the case of collaborative filtering, at the moment every time a user is providing some ratings, we need to update the ratings matrix and then re-fill the gaps, using the new information. That is very expensive and not maintainable, in a system where we are receiving hundreds of rating per day. What we could do instead is store the new ratings daily in some temporary storage and update the permanent rating matrix in batches. Moreover, if we have thousands of books, predicting the rating for each user may not be ideal, so we might consider some filtering here as well.

Generally, a lot of effective predictive systems will choose to employ a combination of methods. For example, we might choose to cluster books in certain groups according using content based filtering, those could be very inclusive clusters (they would still include multiple books). We could then use collaborate filtering, within each cluster, to get more specific recommendations. This would save a lot of computational overhead, as the clusters could be pre-defined and the collaborate filtering would require only a portion of the original data (as it would use a rating matrix that refers to books only within the cluster). Another example would be to use both collaborate filtering and association rules on each case. We could then compare the recommendations returned from both the chosen algorithms and only return the ones found in both.

A last note is to try and use pre-built packages and algorithms when appropriate, as they have been optimised for performance. Make sure however, you understand how those algorithms function in order to be able to provide them with the most appropriate data, and correct parameters. Furthermore, you need to be able to make adjustments when required, interpret their outcomes and measure their performance as objectively as possible.

7.4.2 Alternatives

Another approach to optimisations could be looking at alternative algorithms that may be more ‘accurate’ or efficient or maybe just more appropriate for your case. As an example let’s find out about some alternative algorithms on collaborate filtering.

Previously we used Euclidean method, through a library provided by R, to identify similar users through their ratings. There are other well-known methods for identifying ‘similar’ users, such as the Pearson method or Jaccard method. In the Pearson method each series of rating for every user is treated as a vector. The difference between two vectors is their angle which can be found by taking the cosine of their magnitudes.

In the Jaccard similarity, the values of the ratings do not matter. Similar user’s are just users that have read the same books/clicked on the same websites/seen the same movies et cetera, this has some clear drawbacks (just cause they read a book it doesn’t necessarily mean they liked it) but it does not explicitly require users’ to have given out feedback.

Last you might want to consider taking the correlation distance (mentioned on content based filtering) for something like ratings. In this case, association is described by whether or not the users have rated the same items above or bellow their average.

Up to now we have discussed collaborate filtering by looking at similar users. There is however an alternative approach, called Latent Factor Analysis.This method attempts to identify the underlaying factors that caused each user to rate each film the way the did, we then use those factors to extrapolate (predict) what their rating would be for the rest of the films.

To achieve this, we need to assume that there is a factor/combination of factors for each reader that makes them rate books a specific way. (e.g. a reader is into drama books). We also need to assume that there is a factor/combination of factors relevant to each book that result in bad/good rating (e.g. the author is very popular).

Those underling factors can be identified using PCA. I have dedicated a previous chapter for PCA, but for now all you need to know is that we can somehow tell which are the main factors motivating reader to give certain ratings and books to receive certain ratings. We do not, however, have the values associated with those factors. We only have the outcomes that were produced by combining the factors of each user with each product (those are the ratings!). So for every rating we know that rating = F reader * F book

Each user has its own set of equations, and all we need to do is solve for F reader and F book. The issue is, that not all the ratings are available for each user, meaning that we will not be able to find the exact numbers. There will always be an error, In other words rating - (F reader * F book) will not be zero.

To estimates the factors in the best way we can, we will attempt to find some F reader F book, that minimises the error. (There are various techniques for that such as the stochastic gradient dissent that attempts to find a local minimum by trial… )

Once we have those values we can simply use them to fill in any gap in the matrix. For example, lets say we didn’t know the rating reader 1 gave for book 12. We had however enough data to calculate a factor 2 for that user and a factor 2 for that book. The rating would be 4, and so we should recommend book 12 to reader 1. (of course this is a very simplified example to describe the underlying mechanics. In reality a lot more factors would be involved and we would also need to account for other issues such as overfitting in our equations)

Apart from measuring our alternatives in terms of algorithms, we should also consider alternatives in terms of data. Aside from explicit ratings, costumer feedback takes on many forms such as comments in either the provider’s platform or social media. It is possible, through sentimental analysis to capture the essence of those comments and use them to train our models and give more precise recommendations. A lot of those methods look at the presence and frequency of words and attempt to classify comments as positive or negative. For example the naive Bayer’s algorithm will measure the probability of a comment been positive and the probability of it been negative and decide on the one with the highest probability. Those probabilities are assigned by compering the presence and frequency of certain words compared to their average frequency and the label(negative/positive) available from the training sets (past data, labeled by humans used to train the algorithm). There is a lot more to sentimental analysis and its value on recommendation systems. It can often be used to deduce additional features/attributes, in the case of books, we could classify books depending on the frequency of certain words (e.g. vampires, fairy, solders).

Last we should not exclude the option of classifying customers themselves. Although, it is generally preferred to adopt a product oriented approach, as the product is less complicated and more well defined, we can always recommend products that were bought by users that share similar attributes (e.g. sex, age, interests).

7.5 Conclusion

This is by no means a complete guide on recommendation systems but I hope you were able to understand the concepts and appreciate the potential advantages as well as the challenges around them.