Affinity Analysis Example

  • Sometimes we just want to make more money.

  • Data mining can certainly help!

  • “Users who liked this product also liked…”

  • Where does that information come from?

What is Affinity Analysis?

  • When we want to find similarity between two (or more) samples. Examples:

    • Users on a website for ads
    • Products to sell to those users
    • Genetics (who’s related)
    • Music (want to keep you listening)
    • Law enforcement (correlated crimes)
    • Natural Science (natural process which occur together)

Let’s do Product Recomendations

  • It used to be the case that a person would sell you things.
  • Sales are complicated, and up-selling or selling other items (for good or ill) is a complex process.
  • How could we recover this expertise with data mining?
  • We’re going to take a simple approach. Let’s assume that if two items are often bought together, it’s likely they’ll be bought together again.

  • A really simple implementation is to lookup previous transactions where product \(A\) was bought, and recommend other things that were in that transaction at random.

  • Not bad… but we can do better with data mining!

  • We’ll only consider two products at a time to keep complexity under control, but in principle we can consider however many we want.

  • If a person buys \(A\), how likely are they to buy \(B\)?

  • (yes it’s possible to be even more intelligent, someone who buys salad is more likely to buy ranch dressing, for example)… but that’s beyond our scope

https://github.com/dataPipelineAU/LearningDataMiningWithPython/blob/master/LearningDataMiningBook/Chapter%201/affinity_dataset.txt

This is a txt file where each row is a single transaction, each number corresponds to a single item. \(0\) is no purchase, \(1\) is at least one item bought.

Yes… if we were any good, we’d keep more specific numbers.

Simple ranking of rules

  • We’re interested rules of the form “Dude gets product \(A\) therefore is likely to purchase product \(B\)”.

  • We could just loop through our dataset and fetch every instance of two items being bought together… but not all of those matches make a good heuristic…

  • We’ll evaluate our rules in two ways: support and confidence.

  • Support is just the number of times the rule showed up (like a vote), it could be normalized (how often was the premise true?), but we won’t do that this time.

  • Confidence is the accuracy of the rule (how often the rule occurs divided by occurs)

  • We’ll have to compute these for each rule over the whole database.

  • We’ll use dictionaries (for both valid and invalid rules), with a tuple as the key (premise, conclusion).

  • If a premise is present, but the conclusion isn’t, that rule goes into the invalid pile.

Now to implement!