Evaluating Recommender System

By: Nunung Nurul Qomariyah, Ph.D

Evaluating Recommender Systems can be difficult because each algorithm has its own particular focus. Some work better with a large dataset, whilst others work better with a smaller dataset. Furthermore, the different types of dataset used in some RS algorithms make them difficult to compare with others. Some work better in a specific domain and are not suitable for other domains. Choosing an appropriate method to evaluate an RS will become an important issue. An appropriate method of evaluation ensures the system is confidently implemented in the market or in making a novelty system for academic purposes.

The evaluation of an RS can be conducted using several methods and can be an offline test or online test. The test can be combined with a user study to measure how satisfied the user is with the system. An offline test will not cost too much because it does not need user involvement. An offline test is usually performed to make sure that all algorithms and environments work well before they are given an online test. The offline test only looks at users’ historical data. An online test can be more useful because it can discover the real taste of the users. Some important properties to be considered in deciding the best recommender algorithms include:

  1. Prediction accuracy and coverage: This is the most discussed property in many sources. This property follows the basic assumption in Recommender Systems that users prefer more accurate prediction results which cover more items.
  2. Cold-start: In many cases, when a new item or a new user is added to the system, the recommender algorithm can have difficulties in making a recommendation because it lacks sufficient information. This problem is called a cold-start.
  3. Novelty and serendipity: Novelty is used to measure how the RS shows items that are new or unknown to the user. Serendipity is used to measure how the recommender system provides surprising yet beneficial items to the users.
  4. Diversity: Some users of a certain type of application may like diverse recommendations rather than items which are too similar.
  5. Utility: A utility score is calculated from values that the system gives to the users.
  6. Trust, risk and privacy: This property refers to user risk when accepting the recommendation. For example, in stock purchasing recommendations, users may deal with a higher risk compared to movie recommendations. The user also needs to feel secure in using the system. This includes privacy and trust.

 

Sources:

Qomariyah, N. N. (2018). Pairwise Preferences Learning for Recommender Systems (Doctoral dissertation, University of York).