Sometimes we just want to make more money.
Data mining can certainly help!
“Users who liked this product also liked…”
Where does that information come from?
When we want to find similarity between two (or more) samples. Examples:
We’re going to take a simple approach. Let’s assume that if two items are often bought together, it’s likely they’ll be bought together again.
A really simple implementation is to lookup previous transactions where product \(A\) was bought, and recommend other things that were in that transaction at random.
Not bad… but we can do better with data mining!
We’ll only consider two products at a time to keep complexity under control, but in principle we can consider however many we want.
If a person buys \(A\), how likely are they to buy \(B\)?
(yes it’s possible to be even more intelligent, someone who buys salad is more likely to buy ranch dressing, for example)… but that’s beyond our scope
https://github.com/dataPipelineAU/LearningDataMiningWithPython/blob/master/LearningDataMiningBook/Chapter%201/affinity_dataset.txt
This is a txt file where each row is a single transaction, each number corresponds to a single item. \(0\) is no purchase, \(1\) is at least one item bought.
Yes… if we were any good, we’d keep more specific numbers.
We’re interested rules of the form “Dude gets product \(A\) therefore is likely to purchase product \(B\)”.
We could just loop through our dataset and fetch every instance of two items being bought together… but not all of those matches make a good heuristic…
We’ll evaluate our rules in two ways: support and confidence.
Support is just the number of times the rule showed up (like a vote), it could be normalized (how often was the premise true?), but we won’t do that this time.
Confidence is the accuracy of the rule (how often the rule occurs divided by occurs)
We’ll have to compute these for each rule over the whole database.
We’ll use dictionaries (for both valid and invalid rules), with a tuple as the key (premise, conclusion).
If a premise is present, but the conclusion isn’t, that rule goes into the invalid pile.