Affinity Analysis Example

Sometimes we just want to make more money.
Data mining can certainly help!
“Users who liked this product also liked…”
Where does that information come from?

What is Affinity Analysis?

When we want to find similarity between two (or more) samples. Examples:
- Users on a website for ads
- Products to sell to those users
- Genetics (who’s related)
- Music (want to keep you listening)
- Law enforcement (correlated crimes)
- Natural Science (natural process which occur together)

Let’s do Product Recomendations

It used to be the case that a person would sell you things.
Sales are complicated, and up-selling or selling other items (for good or ill) is a complex process.
How could we recover this expertise with data mining?

We’re going to take a simple approach. Let’s assume that if two items are often bought together, it’s likely they’ll be bought together again.
A really simple implementation is to lookup previous transactions where product \(A\) was bought, and recommend other things that were in that transaction at random.
Not bad… but we can do better with data mining!

We’ll only consider two products at a time to keep complexity under control, but in principle we can consider however many we want.
If a person buys \(A\), how likely are they to buy \(B\)?
(yes it’s possible to be even more intelligent, someone who buys salad is more likely to buy ranch dressing, for example)… but that’s beyond our scope

https://github.com/dataPipelineAU/LearningDataMiningWithPython/blob/master/LearningDataMiningBook/Chapter%201/affinity_dataset.txt

This is a txt file where each row is a single transaction, each number corresponds to a single item. \(0\) is no purchase, \(1\) is at least one item bought.

Yes… if we were any good, we’d keep more specific numbers.

Simple ranking of rules

We’re interested rules of the form “Dude gets product \(A\) therefore is likely to purchase product \(B\)”.
We could just loop through our dataset and fetch every instance of two items being bought together… but not all of those matches make a good heuristic…
We’ll evaluate our rules in two ways: support and confidence.

Support is just the number of times the rule showed up (like a vote), it could be normalized (how often was the premise true?), but we won’t do that this time.
Confidence is the accuracy of the rule (how often the rule occurs divided by occurs)

We’ll have to compute these for each rule over the whole database.
We’ll use dictionaries (for both valid and invalid rules), with a tuple as the key (premise, conclusion).
If a premise is present, but the conclusion isn’t, that rule goes into the invalid pile.

Affinity Analysis Example

What is Affinity Analysis?

Let’s do Product Recomendations

Simple ranking of rules

Now to implement!