How to Understand Matrix Factorization

math
By Isaac

Don’t let the word “matrix” scare you off! Even if your only experience with them is from the movie The Matrix, you know enough to learn about how they can be used to recommend you movies you might like, including, if you haven’t seen it already, The Matrix.  Today, we'll go over matrix factorization by taking a look at Netflix!

matrix.png

 

A Matrix is like an Excel Spreadsheet

A matrix is just an array of numbers, like in The Matrix background above, and--more mundanely--many Excel spreadsheets. There’s a lot more to a matrix than meets to eye, though. Matrices have a fascinating algebraic structure – there’s a way to take two spreadsheet tables and multiply them together to get a third. The calculation is messy and the meaning opaque, even if you’ve taken a linear algebra course. So to make this clearer, we’re going to look at matrices from a different perspective.

Somewhere at Netflix headquarters is a massive spreadsheet. There’s a row for every Netflix user, and every column corresponds to a movie or a show. There’s a value there that might indicate if you’ve seen that show, if you liked it, if you stopped halfway through, etc. It’s a massive collection of numbers, considering the tens of millions of users and thousands of shows. It’s also quite sparse, which means there are lots of zeros, seeing as most people have only seen a small slice of the total content available.

A Matrix is secretly a function

To make a little more sense of this array of numbers, we’re going to think of it in a new way. Let’s consider this matrix to be a function, from the space of people to the space of movie ratings. You input a user into the function and the output is their rating history. However, because the matrix is huge, so is this function. It’s very hard to make sense of all this data, to find patterns and make useful recommendations. That’s where matrix multiplication comes in.

So what is this multiplication all about? Let’s suppose you are a nutritionist. On your PC are two spreadsheets. One documents the foods your clients eat every day. The other documents the caloric value of each foods. The first is a function form the space of people to the space of foods, and the second is a function from the space of foods to the space of calories. You can compose these functions, applying one after the other. This has the effect of assigning to every person their daily caloric value, by summing up the calories of each food they eat. This new matrix, whose rows are people and whose columns are daily caloric intake, is precisely the product of the two original tables. This is the intuition for what matrix multiplication means.

Then how do I use matrix algebra to generate movie recommendations?

So now we return to our movie database. We have a huge matrix (or a function) that is simply too large to make sense of. But we can guess that the task of recommending movies isn’t really that complex, that it really doesn’t depend on thousands of variables. Why? Because genres exist! I like science fiction, and The Matrix is a highly-rated science fiction movie. Recommendation made. But actually, it’s not so simple. How does Netflix know what movies I like? How do they choose the right genres, and decide to what numerical extent The Matrix is a science fiction film and to what extent it is a psychological thriller or a horror…?

What we are, in a sense, asking for, is two more functions. One function from the space of people to the space of genres, and another from the space of genres to the space of movies or shows. These two functions would just be matrices, recording to what extent I like various genres, and to what extent each genre is represented in a given film or show. The product of these matrices should agree with the matrix of user preferences that Netflix already has, at least for the nonzero entries that correspond to the shows actually having been watched. So instead of multiplying two matrices to get a third, I want to start with my matrix and find two new matrices which multiply together to the original (or close to it). In other words, I want to do matrix factorization.

At this point, some more sophisticated mathematics takes over. Algorithms have been designed to perform this matrix factorization, and in the real world we can only expect for the product of these factors to be close to our original matrix. After all, human preferences are complex; they can be well-approximated by a list of simple genres, but not entirely explained by them.

How many genres are there? Well…it’s up to you.

There is still one more matter of concern here. The algorithms mentioned above need to be given a number – how many genres you want there to be. This is a parameter that can be tuned by the people working at Netflix. On the one hand, you could try to have a single genre, and really reduce the complexity of the problem to almost nothing, but the factorization will approximate the real data very poorly, since the real world has more than one kind of cinematic genre. On the other hand, you could try having millions of genres. You’d be guaranteed to approximate the real data very well, but you wouldn’t have simplified the problem at all. After all, you could make a genre for every person, i.e. “the genre of movies Robert Smith likes,” “the genre of movies Laura Yin likes,” etc. With this strategy the addition of genres adds nothing.

The key here is to fiddle around and find some small (but not tiny) number of genres, which is large enough to help explain real people’s preferences, but small enough to reduce the complexity of the dataset to the point that patterns emerge, and recommendations become possible. At this sweet spot, the genres will correspond (closely enough) to real genres like horror, comedy, action, and maybe a few niche but important special interests.

The Concluding Scene

So that, in a summary, is how (approximately) factoring a large matrix (or a numerical spreadsheet – same thing) into two pieces of just the right size can be used to find a small collection of variables that give insight into the patterns of your data. I guess if you’ve made it all to the way to the end of this short expository article, you ought to be commended for taking the red pill. Who would have thought the rabbit-hole ended in math class?

Are you interested in connecting with one of our mathematics tutors?

Contact us!

Want to read more blog posts on mathematics?  Check out the following!

How to Sketch Any Graph by Eye

Four Mathematicians You Should Know

What is Spectral Geometry?

Comments

topicTopics
academics study skills MCAT medical school admissions SAT college admissions expository writing English MD/PhD admissions strategy writing LSAT GMAT GRE physics chemistry biology math graduate admissions academic advice ACT interview prep law school admissions test anxiety language learning career advice premed MBA admissions personal statements homework help AP exams creative writing MD study schedules test prep computer science Common Application summer activities history mathematics philosophy organic chemistry secondary applications economics supplements research 1L PSAT admissions coaching grammar law psychology statistics & probability legal studies ESL dental admissions CARS SSAT covid-19 logic games reading comprehension engineering USMLE calculus mentorship PhD admissions Spanish parents Latin biochemistry case coaching verbal reasoning DAT English literature STEM excel medical school political science skills AMCAS French Linguistics MBA coursework Tutoring Approaches academic integrity chinese genetics letters of recommendation mechanical engineering Anki DO Social Advocacy admissions advice algebra art history artificial intelligence astrophysics business careers cell biology classics dental school diversity statement gap year geometry kinematics linear algebra mental health presentations quantitative reasoning study abroad tech industry technical interviews time management work and activities 2L DMD IB exams ISEE MD/PhD programs Sentence Correction adjusting to college algorithms amino acids analysis essay athletics business skills cold emails data science finance first generation student functions graphing information sessions international students internships logic networking poetry resume revising science social sciences software engineering trigonometry writer's block 3L AAMC Academic Interest EMT FlexMed Fourier Series Greek Health Professional Shortage Area Italian Lagrange multipliers London MD vs PhD MMI Montessori National Health Service Corps Pythagorean Theorem Python Shakespeare Step 2 TMDSAS Taylor Series Truss Analysis Zoom acids and bases active learning architecture argumentative writing art art and design schools art portfolios bacteriology bibliographies biomedicine brain teaser campus visits cantonese capacitors capital markets central limit theorem centrifugal force chemical engineering chess chromatography class participation climate change clinical experience community service constitutional law consulting cover letters curriculum dementia demonstrated interest dimensional analysis distance learning econometrics electric engineering electricity and magnetism escape velocity evolution executive function fellowships freewriting genomics harmonics health policy history of medicine history of science hybrid vehicles hydrophobic effect ideal gas law immunology induction infinite institutional actions integrated reasoning intermolecular forces intern investing investment banking lab reports linear maps mandarin chinese matrices mba medical physics meiosis microeconomics mitosis mnemonics music music theory nervous system neurology neuroscience object-oriented programming office hours