CC This work is licensed under the Creative Commons Attribution 4.0 International License. For questions please contact Michael Hahsler.

Calculate Content-based Item Similarity

We use the Jester data set in recommenderlab.

library("recommenderlab")
## Loading required package: Matrix
## Loading required package: registry
## Loading required package: arules
## 
## Attaching package: 'arules'
## 
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
## 
## Loading required package: proxy
## 
## Attaching package: 'proxy'
## 
## The following object is masked from 'package:Matrix':
## 
##     as.matrix
## 
## The following objects are masked from 'package:stats':
## 
##     as.dist, dist
## 
## The following object is masked from 'package:base':
## 
##     as.matrix
data("Jester5k")

The data set contains 100 jokes.

length(JesterJokes)
## [1] 100
cat(JesterJokes[1])
## A man visits the doctor. The doctor says "I have bad news for you.You have cancer and Alzheimer's disease". The man replies "Well,thank God I don't have cancer!"

Find similar jokes using text mining. Create a corpus first.

library("tm")
## Loading required package: NLP
## 
## Attaching package: 'tm'
## 
## The following object is masked from 'package:arules':
## 
##     inspect
library("SnowballC")
source <- VectorSource(JesterJokes)
corp <- VCorpus(source)
corp
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 100
content(corp[[1]])
## [1] "A man visits the doctor. The doctor says \"I have bad news for you.You have cancer and Alzheimer's disease\". The man replies \"Well,thank God I don't have cancer!\""

Preprocess the data.

corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removeWords, stopwords("english"))
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, stemDocument)
corp <- tm_map(corp, stripWhitespace)

content(corp[[1]])
## [1] " man visit doctor doctor say bad news cancer alzheim diseas man repli wellthank god cancer"

Create a document-term matrix.

dtm <- DocumentTermMatrix(corp)
dtm
## <<DocumentTermMatrix (documents: 100, terms: 1163)>>
## Non-/sparse entries: 2157/114143
## Sparsity           : 98%
## Maximal term length: 15
## Weighting          : term frequency (tf)
inspect(dtm[1:10, 1:10])
## <<DocumentTermMatrix (documents: 10, terms: 10)>>
## Non-/sparse entries: 1/99
## Sparsity           : 99%
## Maximal term length: 9
## Weighting          : term frequency (tf)
## 
##     Terms
## Docs abort accompani account acquit actual add address adopt afraid africa
##   1      0         0       0      0      0   0       0     0      0      0
##   2      0         0       0      0      0   0       0     0      0      0
##   3      0         0       0      0      0   0       0     0      0      0
##   4      0         0       0      0      0   0       0     0      0      0
##   5      0         0       0      0      0   0       1     0      0      0
##   6      0         0       0      0      0   0       0     0      0      0
##   7      0         0       0      0      0   0       0     0      0      0
##   8      0         0       0      0      0   0       0     0      0      0
##   9      0         0       0      0      0   0       0     0      0      0
##   10     0         0       0      0      0   0       0     0      0      0
freq <- colSums(as.matrix(dtm))
head(sort(freq, decreasing = TRUE))
##   say   one   man engin   ask repli 
##    60    40    36    27    26    23

Calculate term frequency-inverse document frequency (tf-idf) scores.

tfidf <- weightTfIdf(dtm, normalize = TRUE)

1. Find Groups Using Clustering

Cluster using cosine similarity and hierachical clustering (complete linkage).

library("proxy")
d <- dist(as.matrix(tfidf), method = "cosine")
hc <- hclust(d)
plot(hc)

grouping <- cutree(hc, h = .9)
sort(table(grouping), decreasing = TRUE)
## grouping
## 25  7  1 14 31 34 35  4  6  8 10 12 15 17 27 28 29 30 33 38 40 42 48  2  3 
##  5  4  3  3  3  3  3  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  1  1 
##  5  9 11 13 16 18 19 20 21 22 23 24 26 32 36 37 39 41 43 44 45 46 47 49 50 
##  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 
## 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 
##  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1

Look at all the jokes in one group. These jokes will be similar.

cat(JesterJokes[grouping==25], sep = "\n\n")
## A mechanical, electrical and a software engineer from Microsoft were driving through the desert when the car broke down. The mechanical engineer said "It seems to be a problem with the fuel injection system, why don't we pop the hood and I'll take a look at it." To which the electrical engineer replied, "No I think it's just a loose ground wire, I'll get out and take a look." Then, the Microsoft engineer jumps in. "No, no, no. If we just close up all the windows, get out, wait a few minutes, get back in, and then reopen the windows everything will work fine."
## 
## There was an engineer who had an exceptional gift for fixing all things mechanical. After serving his company loyally for over 30 years, he happily retired. Several years later the company contacted him regarding a seemingly impossible problem they were having with one of their multi-million dollar machines. They had tried everything and everyone else to get the machine fixed, but to no avail. In desperation, they called on the retired engineer who had solved so many of their problems in the past. The engineer reluctantly took the challenge. He spent a day studying the huge machine. At the end of the day, he marked a small "x" in chalk on a particular component of the machine and proudly stated, "This is where your problem is". The part was replaced and the machine worked perfectly again. The company received a bill for $50,000 from the engineer for his service.They demanded an itemized accounting of his charges. The engineer responded briefly: One chalk mark $1 Knowing where to put it $49,999 It was paid in full and the engineer retired again in peace.
## 
## Three engineering students were gathered together discussing the possible designers of the human body. One said, "It was a mechanical engineer. Just look at all the joints." Another said, "No, it was an electrical engineer. The nervous systems many thousands of electrical connections." The last said, "Actually it was a civil engineer. Who else would run a toxic waste pipeline through a recreational area?"
## 
## Q: What is the difference between Mechanical Engineers and Civil Engineers? A: Mechanical Engineers build weapons, Civil Engineers build targets.
## 
## Reaching the end of a job interview, the human resources person asked a young engineer fresh out of Stanford, "And what starting salary were you looking for?" The engineer said, "In the neighborhood of $125,000 a year, depending on the benefits package." The interviewer said, "Well, what would you say to a package of 5-weeks vacation, 14 paid holidays, full medical and dental, company matching retirement fund to 50% of salary, and a company car leased every 2 years - say, a red Corvette?" The Engineer sat up straight and said, "Wow! Are you kidding?" And the interviewer replied, "Yeah, but you started it."
cat(JesterJokes[grouping==7], sep = "\n\n")
## How many feminists does it take to screw in a light bulb? That's not funny.
## 
## How many men does it take to screw in a light bulb? One...men will screw anything.
## 
## Q: How many stalkers does it take to change a light bulb? A: Two. One to replace the bulb, and the other to watch it day and night.
## 
## Q: How many Presidents does it take to screw in a light bulb? A: It depends upon your definition of screwing a light bulb.

If a user likes a joke in a group, then the recommender system will recommend similar jokes in the same group.

2. Use k-nearest Neighbors

kNN always recommend the k most similar jokes.

library("dbscan")
nn <- kNN(d, 5)

nearest neighbors for joke 1

cat(JesterJokes[1])
## A man visits the doctor. The doctor says "I have bad news for you.You have cancer and Alzheimer's disease". The man replies "Well,thank God I don't have cancer!"
cat(JesterJokes[nn$id[1,]], sep = "\n\n")
## A man, recently completing a routine physical examination receives a phone call from his doctor. The doctor says, "I have some good news and some bad news." The man says, "OK, give me the good news first." The doctor says, "The good news is, you have 24 hours to live." The man replies, "Shit! That's the good news? Then what's the bad news?" The doctor says, "The bad news is, I forgot to call you yesterday."
## 
## A Czechoslovakian man felt his eyesight was growing steadily worse, and felt it was time to go see an optometrist. The doctor started with some simple testing, and showed him a standard eye chart with letters of diminishing size: CRKBNWXSKZY. . . "Can you read this?" the doctor asked. "Read it?" the Czech answered. "Doc, I know him!"
## 
## A man piloting a hot air balloon discovers he has wandered off course and is hopelessly lost. He descends to a lower altitude and locates a man down on the ground. He lowers the balloon further and shouts "Excuse me, can you tell me where I am?" The man below says: "Yes, you're in a hot air balloon, about 30 feet above this field." "You must work in Information Technology," says the balloonist. "Yes I do," replies the man. "And how did you know that?" "Well," says the balloonist, "what you told me is technically correct, but of no use to anyone." The man below says, "You must work in management." "I do," replies the balloonist, "how did you know?" "Well," says the man, "you don't know where you are, or where you're going, but you expect my immediate help. You're in the same position you were before we met, but now it's my fault!"
## 
## There once was a man and a woman that both got in a terrible car wreck. Both of their vehicles were completely destroyed, buy fortunately, no one was hurt. In thankfulness, the woman said to the man, 'We are both okay, so we should celebrate. I have a bottle of wine in my car, let's open it.' So the woman got the bottleout of the car, and handed it to the man. The man took a really big drink, and handed the woman the bottle. The woman closed the bottle and put it down. The man asked, 'Aren't you going to take a drink?' The woman cleverly replied, 'No, I think I'll just wait for the cops to get here.'
## 
## What is the difference between men and women: A woman wants one man to satisfy her every need. A man wants every woman to satisfy his one need.

3. Use fixed-radius nearest neighbors

Recommend all jokes that are more similar than a given threshold

nn <- frNN(d, .9) ### fixed radius nearest neighbors
cat(JesterJokes[nn$id[[1]]], sep = "\n\n")
## A man, recently completing a routine physical examination receives a phone call from his doctor. The doctor says, "I have some good news and some bad news." The man says, "OK, give me the good news first." The doctor says, "The good news is, you have 24 hours to live." The man replies, "Shit! That's the good news? Then what's the bad news?" The doctor says, "The bad news is, I forgot to call you yesterday."
## 
## A Czechoslovakian man felt his eyesight was growing steadily worse, and felt it was time to go see an optometrist. The doctor started with some simple testing, and showed him a standard eye chart with letters of diminishing size: CRKBNWXSKZY. . . "Can you read this?" the doctor asked. "Read it?" the Czech answered. "Doc, I know him!"
cat(JesterJokes[nn$id[[2]]], sep = "\n\n")

Other options