Bag-of-Words and TF-IDF

The story so far…

Distances

  • We measure similarity between observations by calculating distances.

  • Euclidean distance: sum of squared differences, then square root

  • Manhattan distance: sum of absolute differences

  • In scikit-learn, use the pairwise_distances() function to get back a 2D numpy array of distances.

Scaling

  • It is important that all our features be on the same scale for distances to be meaningful.

  • Standardize: Subtract the mean (of the column) and divide by the standard deviation (of the column).

  • MinMax: Subtract the minimum value, divide by the range.

  • In scikit-learn, use the StandardScaler() or MinMaxScaler() functions.

  • Follow the specify - fit - transform code structure.

Bag of Words

Text Data

A textual data set consists of multiple texts. Each text is called a document. The collection of texts is called a corpus.

Example Corpus:

  1. "I am Sam\n\nI am Sam\nSam I..."

  2. "The sun did not shine.\nIt was..."

  3. "Fox\nSocks\nBox\nKnox\n\nKnox..."

  4. "Every Who\nDown in Whoville\n..."

  5. "UP PUP Pup is up.\nCUP PUP..."

  6. "On the fifteenth of May, in the..."

  7. "Congratulations!\nToday is your..."

  8. "One fish, two fish, red fish..."

Reading Text Data

Documents are usually stored in different files.

seuss_dir = "http://dlsun.github.io/pods/data/drseuss/"
seuss_files = [
    "green_eggs_and_ham.txt", 
    "cat_in_the_hat.txt",
    "fox_in_socks.txt", 
    "how_the_grinch_stole_christmas.txt",
    "hop_on_pop.txt", 
    "horton_hears_a_who.txt",
    "oh_the_places_youll_go.txt", 
    "one_fish_two_fish.txt"]

…so

…we have to read them in one by one

import requests

docs = {}
for filename in seuss_files:
    response = requests.get(seuss_dir + filename)
    docs[filename] = response.text


docs.keys()
dict_keys(['green_eggs_and_ham.txt', 'cat_in_the_hat.txt', 'fox_in_socks.txt', 'how_the_grinch_stole_christmas.txt', 'hop_on_pop.txt', 'horton_hears_a_who.txt', 'oh_the_places_youll_go.txt', 'one_fish_two_fish.txt'])

Bag-of-Words Representation

In the bag-of-words representation in this data, each column represents a word, and the values in the column are the word counts for that document.

First, we need to split each document into individual words.

docs["hop_on_pop.txt"].split()
['UP', 'PUP', 'Pup', 'is', 'up.', 'CUP', 'PUP', 'Pup', 'in', 'cup.', 'PUP', 'CUP', 'Cup', 'on', 'pup.', 'MOUSE', 'HOUSE', 'Mouse', 'on', 'house.', 'HOUSE', 'MOUSE', 'House', 'on', 'mouse.', 'ALL', 'TALL', 'We', 'all', 'are', 'tall.', 'ALL', 'SMALL', 'We', 'all', 'are', 'small.', 'ALL', 'BALL', 'We', 'all', 'play', 'ball.', 'BALL', 'WALL', 'Up', 'on', 'a', 'wall.', 'ALL', 'FALL', 'Fall', 'off', 'the', 'wall.', 'DAY', 'PLAY', 'We', 'play', 'all', 'day.', 'NIGHT', 'FIGHT', 'We', 'fight', 'all', 'night.HE', 'ME', 'He', 'is', 'after', 'me.', 'HIM', 'JIM', 'Jim', 'is', 'after', 'him.', 'SEE', 'BEE', 'We', 'see', 'a', 'bee.', 'SEE', 'BEE', 'THREE', 'Now', 'we', 'see', 'three.', 'THREE', 'TREE', 'Three', 'fish', 'in', 'a', 'tree.', 'Fish', 'in', 'a', 'tree?', 'How', 'can', 'that', 'be?', 'RED', 'RED', 'They', 'call', 'me', 'Red.', 'RED', 'BED', 'I', 'am', 'in', 'bed.', 'RED', 'NED', 'TED', 'and', 'ED', 'in', 'BED', 'PAT', 'PAT', 'they', 'call', 'him', 'Pat.', 'PAT', 'SAT', 'Pat', 'sat', 'on', 'hat.', 'PAT', 'CAT', 'Pat', 'sat', 'on', 'cat.', 'PAT', 'BAT', 'Pat', 'sat', 'on', 'bat.', 'NO', 'PAT', 'NO', 'Don’t', 'sit', 'on', 'that.', 'SAD', 'DAD', 'BAD', 'HAD', 'Dad', 'is', 'sad.', 'Very,', 'very', 'sad.', 'He', 'had', 'a', 'bad', 'day.', 'What', 'a', 'day', 'Dad', 'had!', 'THING', 'THING', 'What', 'is', 'that', 'thing?', 'THING', 'SING', 'That', 'thing', 'can', 'sing!', 'SONG', 'LONG', 'A', 'long,', 'long', 'song.', 'Good-by,', 'Thing.', 'You', 'sing', 'too', 'long.', 'WALK', 'WALK', 'We', 'like', 'to', 'walk.', 'WALK', 'TALK', 'We', 'like', 'to', 'talk.', 'HOP', 'POP', 'We', 'like', 'to', 'hop.', 'We', 'like', 'to', 'hop', 'on', 'top', 'of', 'Pop.', 'STOP', 'You', 'must', 'not', 'hop', 'on', 'Pop.', 'Mr.', 'BROWN', 'Mrs.', 'BROWN', 'Mr.', 'Brown', 'upside', 'down.', 'Pup', 'up.', 'Brown', 'down.', 'Pup', 'is', 'down.', 'Where', 'is', 'Brown?', 'WHERE', 'IS', 'BROWN?', 'THERE', 'IS', 'BROWN!', 'Mr.', 'Brown', 'is', 'out', 'of', 'town.', 'BACK', 'BLACK', 'Brown', 'came', 'back.', 'Brown', 'came', 'back', 'with', 'Mr.', 'Black.', 'SNACK', 'SNACK', 'Eat', 'a', 'snack.', 'Eat', 'a', 'snack', 'with', 'Brown', 'and', 'Black.', 'JUMP', 'BUMP', 'He', 'jumped.', 'He', 'bumped.', 'FAST', 'PAST', 'He', 'went', 'past', 'fast.', 'WENT', 'TENT', 'SENT', 'He', 'went', 'into', 'the', 'tent.', 'I', 'sent', 'him', 'out', 'of', 'the', 'tent.', 'WET', 'GET', 'Two', 'dogs', 'get', 'wet.', 'HELP', 'YELP', 'They', 'yelp', 'for', 'help.', 'HILL', 'WILL', 'Will', 'went', 'up', 'hill.', 'WILL', 'HILL', 'STILL', 'Will', 'is', 'up', 'hill', 'still.', 'FATHER', 'MOTHER', 'SISTER', 'BROTHER', 'That', 'one', 'is', 'my', 'other', 'brother.', 'My', 'brothers', 'read', 'a', 'little', 'bit.', 'Little', 'words', 'like', 'If', 'and', 'it.', 'My', 'father', 'can', 'read', 'big', 'words,', 'too.', 'Like', 'CONSTANTINOPLE', 'and', 'TIMBUKTU', 'SAY', 'SAY', 'What', 'does', 'this', 'say?', 'seehemewe', 'patpuppop', 'hethreetreebee', 'tophopstop', 'Ask', 'me', 'tomorrow', 'but', 'not', 'today.']

Then Count the Words

from collections import Counter
Counter(
  docs["hop_on_pop.txt"]
  .split()
  )
Counter({'is': 10, 'on': 10, 'We': 10, 'a': 9, 'He': 6, 'PAT': 6, 'Brown': 6, 'in': 5, 'all': 5, 'like': 5, 'Pup': 4, 'ALL': 4, 'RED': 4, 'and': 4, 'to': 4, 'Mr.': 4, 'PUP': 3, 'the': 3, 'can': 3, 'Pat': 3, 'sat': 3, 'What': 3, 'THING': 3, 'WALK': 3, 'of': 3, 'down.': 3, 'went': 3, 'up.': 2, 'CUP': 2, 'MOUSE': 2, 'HOUSE': 2, 'are': 2, 'BALL': 2, 'play': 2, 'wall.': 2, 'day.': 2, 'after': 2, 'SEE': 2, 'BEE': 2, 'see': 2, 'THREE': 2, 'that': 2, 'They': 2, 'call': 2, 'me': 2, 'BED': 2, 'I': 2, 'him': 2, 'NO': 2, 'Dad': 2, 'sad.': 2, 'That': 2, 'You': 2, 'hop': 2, 'Pop.': 2, 'not': 2, 'BROWN': 2, 'IS': 2, 'out': 2, 'came': 2, 'with': 2, 'Black.': 2, 'SNACK': 2, 'Eat': 2, 'tent.': 2, 'HILL': 2, 'WILL': 2, 'Will': 2, 'up': 2, 'My': 2, 'read': 2, 'SAY': 2, 'UP': 1, 'cup.': 1, 'Cup': 1, 'pup.': 1, 'Mouse': 1, 'house.': 1, 'House': 1, 'mouse.': 1, 'TALL': 1, 'tall.': 1, 'SMALL': 1, 'small.': 1, 'ball.': 1, 'WALL': 1, 'Up': 1, 'FALL': 1, 'Fall': 1, 'off': 1, 'DAY': 1, 'PLAY': 1, 'NIGHT': 1, 'FIGHT': 1, 'fight': 1, 'night.HE': 1, 'ME': 1, 'me.': 1, 'HIM': 1, 'JIM': 1, 'Jim': 1, 'him.': 1, 'bee.': 1, 'Now': 1, 'we': 1, 'three.': 1, 'TREE': 1, 'Three': 1, 'fish': 1, 'tree.': 1, 'Fish': 1, 'tree?': 1, 'How': 1, 'be?': 1, 'Red.': 1, 'am': 1, 'bed.': 1, 'NED': 1, 'TED': 1, 'ED': 1, 'they': 1, 'Pat.': 1, 'SAT': 1, 'hat.': 1, 'CAT': 1, 'cat.': 1, 'BAT': 1, 'bat.': 1, 'Don’t': 1, 'sit': 1, 'that.': 1, 'SAD': 1, 'DAD': 1, 'BAD': 1, 'HAD': 1, 'Very,': 1, 'very': 1, 'had': 1, 'bad': 1, 'day': 1, 'had!': 1, 'thing?': 1, 'SING': 1, 'thing': 1, 'sing!': 1, 'SONG': 1, 'LONG': 1, 'A': 1, 'long,': 1, 'long': 1, 'song.': 1, 'Good-by,': 1, 'Thing.': 1, 'sing': 1, 'too': 1, 'long.': 1, 'walk.': 1, 'TALK': 1, 'talk.': 1, 'HOP': 1, 'POP': 1, 'hop.': 1, 'top': 1, 'STOP': 1, 'must': 1, 'Mrs.': 1, 'upside': 1, 'Where': 1, 'Brown?': 1, 'WHERE': 1, 'BROWN?': 1, 'THERE': 1, 'BROWN!': 1, 'town.': 1, 'BACK': 1, 'BLACK': 1, 'back.': 1, 'back': 1, 'snack.': 1, 'snack': 1, 'JUMP': 1, 'BUMP': 1, 'jumped.': 1, 'bumped.': 1, 'FAST': 1, 'PAST': 1, 'past': 1, 'fast.': 1, 'WENT': 1, 'TENT': 1, 'SENT': 1, 'into': 1, 'sent': 1, 'WET': 1, 'GET': 1, 'Two': 1, 'dogs': 1, 'get': 1, 'wet.': 1, 'HELP': 1, 'YELP': 1, 'yelp': 1, 'for': 1, 'help.': 1, 'hill.': 1, 'STILL': 1, 'hill': 1, 'still.': 1, 'FATHER': 1, 'MOTHER': 1, 'SISTER': 1, 'BROTHER': 1, 'one': 1, 'my': 1, 'other': 1, 'brother.': 1, 'brothers': 1, 'little': 1, 'bit.': 1, 'Little': 1, 'words': 1, 'If': 1, 'it.': 1, 'father': 1, 'big': 1, 'words,': 1, 'too.': 1, 'Like': 1, 'CONSTANTINOPLE': 1, 'TIMBUKTU': 1, 'does': 1, 'this': 1, 'say?': 1, 'seehemewe': 1, 'patpuppop': 1, 'hethreetreebee': 1, 'tophopstop': 1, 'Ask': 1, 'tomorrow': 1, 'but': 1, 'today.': 1})

Bag-of-Words Representation

… then, we put these counts into a Series.

[
  pd.Series(
    Counter(doc.split())
      ) for doc in docs.values()
  ]
[I           71
am           3
Sam          3
That         2
Sam-I-am     4
            ..
good         2
see!         1
So           1
Thank        2
you!         1
Length: 116, dtype: int64, The         4
sun         2
did         6
not        27
shine.      1
           ..
Now,        1
Well...     1
YOU         1
asked       1
YOU?        1
Length: 503, dtype: int64, Fox       6
Socks     4
Box       1
Knox      8
in       19
         ..
our       1
done,     1
Thank     1
lot       1
fun,      1
Length: 328, dtype: int64, Every        3
Who          9
Down         1
in          15
Whoville     4
            ..
light,       1
brought      1
he,          1
HIMSELF!     1
carved       1
Length: 623, dtype: int64, UP             1
PUP            3
Pup            4
is            10
up.            2
              ..
tophopstop     1
Ask            1
tomorrow       1
but            1
today.         1
Length: 241, dtype: int64, On              5
the            88
fifteenth       1
of             33
May,            1
               ..
summer.         1
rain            1
it's            1
fall-ish,       1
small-ish!"     1
Length: 918, dtype: int64, Congratulations!     1
Today                2
is                   7
your                19
day.                 1
                    ..
day!                 1
Your                 1
mountain             1
So...get             1
way!                 1
Length: 449, dtype: int64, One         1
fish,       7
two         2
red         1
blue        2
           ..
Tomorrow    1
another     1
one.        1
Every       1
there.      1
Length: 501, dtype: int64]

Create a DataFrame

… finally, we stack the Series into a DataFrame. This is called bag of words data.

pd.DataFrame(
    [pd.Series(
      Counter(doc.split())
      ) for doc in docs.values()],
    index = docs.keys()
    )
                                       I   am  Sam  ...  gone.  Tomorrow  one.
green_eggs_and_ham.txt              71.0  3.0  3.0  ...    NaN       NaN   NaN
cat_in_the_hat.txt                  48.0  NaN  NaN  ...    NaN       NaN   NaN
fox_in_socks.txt                     9.0  NaN  NaN  ...    NaN       NaN   NaN
how_the_grinch_stole_christmas.txt   6.0  NaN  NaN  ...    NaN       NaN   NaN
hop_on_pop.txt                       2.0  1.0  NaN  ...    NaN       NaN   NaN
horton_hears_a_who.txt              18.0  1.0  NaN  ...    NaN       NaN   NaN
oh_the_places_youll_go.txt           2.0  NaN  NaN  ...    NaN       NaN   NaN
one_fish_two_fish.txt               48.0  3.0  NaN  ...    1.0       1.0   1.0

[8 rows x 2562 columns]

Bag-of-Words in Scikit-Learn

Alternatively, we can use CountVectorizer() in scikit-learn to produce a bag-of-words matrix.

from sklearn.feature_extraction.text import CountVectorizer

Specify

vec = CountVectorizer()

Fit

vec.fit(docs.values())
CountVectorizer()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Transform

vec.transform(docs.values())
<Compressed Sparse Row sparse matrix of dtype 'int64'
    with 2308 stored elements and shape (8, 1344)>

Entire Vocabulary

The set of words across a corpus is called the vocabulary. We can view the vocabulary in a fitted CountVectorizer() as follows:

vec.vocabulary_
{'am': 23, 'sam': 935, 'that': 1138, 'do': 287, 'not': 767, 'like': 644, 'you': 1336, 'green': 471, 'eggs': 326, 'and': 26, 'ham': 495, 'them': 1141, 'would': 1316, 'here': 526, 'or': 786, 'there': 1143, 'anywhere': 32, 'in': 576, 'house': 558, 'with': 1303, 'mouse': 722, 'eat': 323, 'box': 132, 'fox': 419, 'could': 242, 'car': 179, 'they': 1145, 'are': 35, 'may': 688, 'will': 1292, 'see': 953, 'tree': 1204, 'let': 635, 'me': 691, 'be': 62, 'mot': 718, 'train': 1202, 'on': 778, 'say': 944, 'the': 1139, 'dark': 265, 'rain': 884, 'goat': 453, 'boat': 118, 'so': 1035, 'try': 1213, 'if': 575, 'good': 459, 'thank': 1136, 'sun': 1107, 'did': 279, 'shine': 972, 'it': 586, 'was': 1255, 'too': 1188, 'wet': 1268, 'to': 1178, 'play': 836, 'we': 1261, 'sat': 940, 'all': 16, 'cold': 231, 'day': 270, 'sally': 934, 'two': 1220, 'said': 932, 'how': 560, 'wish': 1302, 'had': 488, 'something': 1042, 'go': 452, 'out': 789, 'ball': 50, 'nothing': 768, 'at': 43, 'sit': 1001, 'one': 780, 'little': 650, 'bit': 102, 'bump': 157, 'then': 1142, 'went': 1265, 'made': 673, 'us': 1231, 'jump': 594, 'looked': 660, 'saw': 943, 'him': 536, 'step': 1077, 'mat': 684, 'cat': 185, 'hat': 507, 'he': 513, 'why': 1285, 'know': 614, 'is': 583, 'sunny': 1108, 'but': 165, 'can': 176, 'have': 512, 'lots': 663, 'of': 773, 'fun': 434, 'funny': 435, 'some': 1039, 'games': 438, 'new': 747, 'tricks': 1207, 'lot': 662, 'show': 986, 'your': 1338, 'mother': 719, 'mind': 704, 'what': 1269, 'our': 788, 'for': 413, 'fish': 388, 'no': 754, 'make': 676, 'away': 44, 'tell': 1130, 'want': 1253, 'should': 980, 'about': 4, 'when': 1271, 'now': 769, 'fear': 368, 'my': 735, 'bad': 47, 'game': 437, 'call': 172, 'up': 1226, 'put': 869, 'down': 299, 'this': 1151, 'fall': 358, 'hold': 543, 'high': 532, 'as': 39, 'stand': 1064, 'book': 122, 'hand': 496, 'cup': 260, 'look': 659, 'cake': 171, 'top': 1191, 'books': 123, 'litte': 649, 'toy': 1200, 'ship': 973, 'milk': 702, 'dish': 284, 'hop': 550, 'oh': 775, 'these': 1144, 'rake': 885, 'man': 680, 'tail': 1117, 'red': 897, 'fan': 362, 'fell': 372, 'his': 538, 'head': 514, 'came': 175, 'from': 429, 'things': 1149, 'into': 582, 'pot': 853, 'lit': 648, 'sank': 937, 'deep': 275, 'shook': 979, 'bent': 85, 'get': 444, 'another': 27, 'ran': 887, 'fast': 364, 'back': 46, 'big': 93, 'wood': 1308, 'shut': 987, 'hook': 548, 'trick': 1206, 'take': 1118, 'got': 462, 'tip': 1175, 'bow': 131, 'pick': 823, 'thing': 1148, 'bite': 103, 'shake': 964, 'hands': 497, 'their': 1140, 'those': 1153, 'gave': 442, 'pat': 807, 'tame': 1125, 'come': 233, 'give': 448, 'fly': 405, 'kites': 611, 'hit': 540, 'run': 925, 'hall': 494, 'wall': 1251, 'thump': 1161, 'string': 1096, 'kite': 610, 'gown': 463, 'her': 525, 'dots': 297, 'pink': 829, 'white': 1277, 'bed': 67, 'bumps': 159, 'jumps': 596, 'kicks': 605, 'hops': 551, 'thumps': 1162, 'kinds': 609, 'way': 1260, 'she': 969, 'home': 545, 'hear': 517, 'find': 380, 'near': 738, 'think': 1150, 'rid': 903, 'after': 7, 'net': 745, 'bet': 87, 'yet': 1327, 'plop': 841, 'last': 625, 'thoe': 1152, 'stop': 1086, 'pack': 793, 'dear': 273, 'shame': 967, 'sad': 929, 'kind': 607, 'has': 505, 'gone': 457, 'yes': 1326, 'mess': 697, 'tall': 1124, 'ca': 168, 'who': 1279, 'always': 22, 'playthings': 838, 'were': 1266, 'picked': 824, 'strings': 1097, 'any': 29, 'well': 1264, 'asked': 41, 'socks': 1037, 'knox': 616, 'chicks': 203, 'bricks': 143, 'blocks': 112, 'clocks': 224, 'sir': 999, 'mr': 727, 'first': 387, 'll': 653, 'quick': 875, 'brick': 142, 'stack': 1063, 'block': 111, 'chick': 202, 'clock': 223, 'ticks': 1165, 'tocks': 1180, 'tick': 1164, 'tock': 1179, 'six': 1003, 'sick': 988, 'please': 840, 'don': 293, 'tongue': 1187, 'isn': 585, 'slick': 1011, 'mixed': 709, 'sorry': 1048, 'an': 25, 'easy': 322, 'whose': 1283, 'sue': 1105, 'sews': 963, 'sees': 959, 'sew': 962, 'comes': 234, 'crow': 255, 'slow': 1014, 'joe': 590, 'clothes': 226, 'rose': 920, 'hose': 554, 'nose': 766, 'goes': 454, 'grows': 482, 'hate': 508, 'makes': 678, 'quite': 879, 'lame': 622, 'blue': 117, 'goo': 458, 'gooey': 460, 'gluey': 451, 'chewy': 201, 'chewing': 200, 'goose': 461, 'doing': 292, 'choose': 209, 'chew': 199, 'won': 1305, 'very': 1237, 'bim': 97, 'ben': 82, 'brings': 146, 'broom': 148, 'bends': 83, 'breaks': 140, 'band': 51, 'bands': 52, 'pig': 825, 'lead': 630, 'brooms': 149, 'bangs': 54, 'booms': 125, 'boom': 124, 'poor': 849, 'mouth': 724, 'much': 730, 'bring': 145, 'luke': 669, 'luck': 668, 'likes': 647, 'lakes': 621, 'duck': 309, 'licks': 637, 'takes': 1119, 'blab': 105, 'such': 1103, 'blibber': 110, 'blubber': 116, 'rubber': 923, 'dumb': 311, 'through': 1160, 'three': 1158, 'cheese': 198, 'trees': 1205, 'free': 420, 'fleas': 395, 'flew': 396, 'while': 1276, 'freezy': 422, 'breeze': 141, 'blew': 109, 'freeze': 421, 'sneeze': 1033, 'enough': 340, 'silly': 991, 'stuff': 1100, 'talk': 1121, 'tweetle': 1218, 'beetles': 72, 'fight': 377, 'called': 173, 'beetle': 71, 'battle': 59, 'puddle': 866, 'paddles': 798, 'paddle': 796, 'bottle': 127, 'muddle': 731, 'battles': 60, 'poodle': 846, 'eating': 324, 'noodles': 761, 'noodle': 760, 'wait': 1244, 'minute': 706, 'where': 1272, 'bottled': 128, 'paddled': 797, 'muddled': 732, 'duddled': 310, 'fuddled': 432, 'wuddled': 1319, 'done': 294, 'every': 346, 'whoville': 1284, 'liked': 645, 'christmas': 210, 'grinch': 473, 'lived': 652, 'just': 598, 'north': 765, 'hated': 509, 'whole': 1280, 'season': 952, 'ask': 40, 'knows': 615, 'reason': 896, 'wasn': 1256, 'screwed': 950, 'right': 905, 'perhaps': 817, 'shoes': 978, 'tight': 1168, 'most': 716, 'likely': 646, 'been': 69, 'heart': 519, 'sizes': 1005, 'small': 1019, 'whatever': 1270, 'stood': 1085, 'eve': 343, 'hating': 510, 'whos': 1282, 'staring': 1068, 'cave': 189, 'sour': 1053, 'grinchy': 475, 'frown': 430, 'warm': 1254, 'lighted': 643, 'windows': 1295, 'below': 81, 'town': 1198, 'knew': 612, 'beneath': 84, 'busy': 164, 'hanging': 499, 'mistletoe': 707, 'wreath': 1318, 're': 892, 'stockings': 1082, 'snarled': 1027, 'sneer': 1031, 'tomorrow': 1186, 'practically': 856, 'growled': 481, 'fingers': 384, 'nervously': 744, 'drumming': 307, 'must': 734, 'coming': 235, 'girls': 447, 'boys': 134, 'wake': 1246, 'bright': 144, 'early': 318, 'rush': 926, 'toys': 1201, 'noise': 755, 'young': 1337, 'old': 777, 'feast': 369, 'pudding': 865, 'rare': 889, 'roast': 912, 'beast': 64, 'which': 1275, 'couldn': 243, 'least': 632, 'close': 225, 'together': 1183, 'bells': 80, 'ringing': 907, 'start': 1069, 'singing': 997, 'sing': 996, 'more': 714, 'thought': 1155, 'fifty': 376, 'years': 1322, 've': 1236, 'idea': 574, 'awful': 45, 'wonderful': 1306, 'laughed': 627, 'throat': 1159, 'santy': 939, 'claus': 216, 'coat': 230, 'chuckled': 211, 'clucked': 229, 'great': 467, 'saint': 933, 'nick': 750, 'need': 742, 'reindeer': 898, 'around': 38, 'since': 994, 'scarce': 945, 'none': 757, 'found': 417, 'simply': 993, 'instead': 581, 'dog': 290, 'max': 687, 'took': 1189, 'thread': 1157, 'tied': 1167, 'horn': 552, 'loaded': 655, 'bags': 48, 'empty': 335, 'sacks': 928, 'ramshackle': 886, 'sleigh': 1010, 'hitched': 541, 'giddap': 446, 'started': 1070, 'toward': 1195, 'homes': 546, 'lay': 629, 'asnooze': 42, 'quiet': 877, 'snow': 1034, 'filled': 378, 'air': 12, 'dreaming': 301, 'sweet': 1112, 'dreams': 302, 'without': 1304, 'care': 180, 'square': 1062, 'number': 771, 'hissed': 539, 'climbed': 222, 'roof': 916, 'fist': 389, 'slid': 1012, 'chimney': 206, 'rather': 890, 'pinch': 828, 'santa': 938, 'stuck': 1099, 'only': 781, 'once': 779, 'moment': 710, 'fireplace': 386, 'flue': 403, 'hung': 570, 'row': 922, 'grinned': 477, 'slithered': 1013, 'slunk': 1017, 'smile': 1023, 'unpleasant': 1225, 'room': 917, 'present': 857, 'pop': 850, 'guns': 485, 'bicycles': 92, 'roller': 915, 'skates': 1006, 'drums': 308, 'checkerboards': 197, 'tricycles': 1208, 'popcorn': 851, 'plums': 843, 'stuffed': 1101, 'nimbly': 752, 'by': 167, 'icebox': 573, 'cleaned': 218, 'flash': 394, 'even': 344, 'hash': 506, 'food': 408, 'glee': 450, 'grabbed': 466, 'shove': 985, 'heard': 518, 'sound': 1051, 'coo': 239, 'dove': 298, 'turned': 1215, 'cindy': 213, 'lou': 664, 'than': 1135, 'caught': 187, 'tiny': 1174, 'daughter': 268, 'water': 1258, 'stared': 1067, 'taking': 1120, 'smart': 1021, 'lie': 638, 'tot': 1194, 'fake': 357, 'lied': 639, 'light': 642, 'side': 989, 'workshop': 1312, 'fix': 391, 'fib': 374, 'fooled': 410, 'child': 204, 'patted': 810, 'drink': 303, 'sent': 960, 'log': 656, 'fire': 385, 'himself': 537, 'liar': 636, 'walls': 1252, 'left': 634, 'hooks': 549, 'wire': 1301, 'speck': 1056, 'crumb': 256, 'same': 936, 'other': 787, 'houses': 559, 'leaving': 633, 'crumbs': 257, 'mouses': 723, 'quarter': 873, 'past': 806, 'dawn': 269, 'still': 1081, 'packed': 795, 'sled': 1008, 'presents': 858, 'ribbons': 902, 'wrappings': 1317, 'tags': 1116, 'tinsel': 1173, 'trimmings': 1209, 'trappings': 1203, 'thousand': 1156, 'feet': 371, 'mt': 729, 'crumpit': 258, 'rode': 914, 'load': 654, 'tiptop': 1176, 'dump': 312, 'pooh': 847, 'grinchishly': 474, 'humming': 565, 'finding': 381, 'waking': 1247, 'mouths': 725, 'hang': 498, 'open': 784, 'cry': 259, 'boo': 121, 'hoo': 547, 'paused': 811, 'ear': 317, 'rising': 908, 'over': 790, 'low': 667, 'grow': 480, 'sounded': 1052, 'merry': 696, 'popped': 852, 'eyes': 352, 'shocking': 976, 'surprise': 1111, 'hadn': 489, 'stopped': 1087, 'somehow': 1040, 'ice': 572, 'puzzling': 872, 'packages': 794, 'boxes': 133, 'puzzled': 870, 'hours': 557, 'till': 1169, 'puzzler': 871, 'sore': 1047, 'before': 74, 'maybe': 689, 'doesn': 289, 'store': 1088, 'means': 693, 'happened': 501, 'grew': 472, 'didn': 280, 'feel': 370, 'whizzed': 1278, 'morning': 715, 'brought': 152, 'carved': 183, 'pup': 868, 'off': 774, 'night': 751, 'jim': 588, 'bee': 68, 'ned': 741, 'ted': 1128, 'ed': 325, 'bat': 57, 'dad': 263, 'song': 1045, 'long': 658, 'walk': 1248, 'brown': 153, 'mrs': 728, 'upside': 1230, 'black': 106, 'snack': 1025, 'jumped': 595, 'bumped': 158, 'tent': 1132, 'dogs': 291, 'help': 523, 'yelp': 1325, 'hill': 534, 'father': 366, 'sister': 1000, 'brother': 150, 'brothers': 151, 'read': 893, 'words': 1309, 'constantinople': 238, 'timbuktu': 1170, 'does': 288, 'seehemewe': 954, 'patpuppop': 809, 'hethreetreebee': 527, 'tophopstop': 1192, 'today': 1181, 'fifteenth': 375, 'jungle': 597, 'nool': 763, 'heat': 520, 'cool': 241, 'pool': 848, 'splashing': 1058, 'enjoying': 339, 'joys': 592, 'horton': 553, 'elephant': 331, 'towards': 1196, 'again': 9, 'faint': 355, 'person': 818, 'calling': 174, 'dust': 314, 'blowing': 115, 'though': 1154, 'murmured': 733, 'never': 746, 'able': 3, 'yell': 1323, 'someone': 1041, 'sort': 1049, 'creature': 251, 'size': 1004, 'seen': 958, 'shaking': 965, 'blow': 114, 'steer': 1076, 'save': 941, 'because': 66, 'matter': 685, 'gently': 443, 'using': 1233, 'greatest': 469, 'stretched': 1095, 'trunk': 1212, 'lifted': 641, 'carried': 181, 'placed': 832, 'safe': 931, 'soft': 1038, 'clover': 227, 'humpf': 567, 'humpfed': 568, 'voice': 1242, 'twas': 1217, 'kangaroo': 599, 'pouch': 855, 'pin': 827, 'believe': 77, 'sincerely': 995, 'ears': 319, 'keen': 601, 'clearly': 221, 'four': 418, 'family': 360, 'children': 205, 'starting': 1071, 'favour': 367, 'disturb': 286, 'fool': 409, 'biggest': 95, 'blame': 107, 'kangaroos': 600, 'plunged': 844, 'terrible': 1133, 'frowned': 431, 'persons': 819, 'drowned': 306, 'protect': 861, 'bigger': 94, 'plucked': 842, 'hustled': 571, 'tops': 1193, 'news': 748, 'quickly': 876, 'spread': 1061, 'talks': 1123, 'flower': 402, 'walked': 1249, 'worrying': 1315, 'almost': 18, 'hour': 556, 'alarm': 13, 'harm': 504, 'walking': 1250, 'talking': 1122, 'barely': 56, 'speak': 1055, 'friend': 425, 'fine': 382, 'helped': 524, 'folks': 407, 'end': 336, 'saved': 942, 'ceilings': 190, 'floors': 401, 'churches': 212, 'grocery': 479, 'stores': 1089, 'mean': 692, 'gasped': 441, 'buildings': 156, 'piped': 830, 'certainly': 191, 'mayor': 690, 'friendly': 426, 'clean': 217, 'seem': 956, 'terribly': 1134, 'aren': 36, 'wonderfully': 1307, 'ville': 1239, 'thankful': 1137, 'greatful': 470, 'worry': 1314, 'spoke': 1059, 'monkeys': 711, 'neck': 739, 'wickersham': 1286, 'shouting': 984, 'rot': 921, 'elephants': 332, 'going': 455, 'nonsense': 758, 'snatched': 1028, 'bottomed': 129, 'eagle': 316, 'named': 737, 'valad': 1234, 'vlad': 1241, 'koff': 617, 'mighty': 699, 'strong': 1098, 'swift': 1113, 'wing': 1296, 'kindly': 608, 'beak': 63, 'late': 626, 'afternoon': 8, 'far': 363, 'bird': 99, 'flapped': 392, 'wings': 1297, 'flight': 398, 'chased': 195, 'groans': 478, 'stones': 1084, 'tattered': 1126, 'toenails': 1182, 'battered': 58, 'bones': 120, 'begged': 75, 'live': 651, 'folk': 406, 'beyond': 90, 'kept': 602, 'flapping': 393, 'shoulder': 981, 'quit': 878, 'yapping': 1321, 'hide': 531, '56': 1, 'next': 749, 'sure': 1109, 'place': 831, 'hid': 529, 'drop': 304, 'somewhere': 1044, 'inside': 579, 'patch': 808, 'clovers': 228, 'hundred': 569, 'miles': 701, 'wide': 1288, 'sneered': 1032, 'fail': 354, 'flip': 399, 'cried': 253, 'bust': 163, 'shall': 966, 'friends': 427, 'searched': 951, 'sought': 1050, 'noon': 764, 'dead': 272, 'alive': 15, 'piled': 826, 'nine': 753, 'five': 390, 'millionth': 703, 'really': 895, 'trouble': 1210, 'share': 968, 'birdie': 100, 'dropped': 305, 'landed': 623, 'hard': 503, 'tea': 1127, 'pots': 854, 'broken': 147, 'rocking': 913, 'chairs': 192, 'smashed': 1022, 'bicycle': 91, 'tires': 1177, 'crashed': 250, 'pleaded': 839, 'stick': 1079, 'making': 679, 'repairs': 900, 'course': 246, 'answered': 28, 'thin': 1147, 'thick': 1146, 'days': 271, 'wild': 1291, 'insisted': 580, 'chatting': 196, 'existed': 350, 'carryings': 182, 'peaceable': 812, 'bellowing': 79, 'bungle': 160, 'state': 1072, 'snapped': 1026, 'nonsensical': 759, 'dozens': 300, 'uncles': 1223, 'wickershams': 1287, 'cousins': 247, 'laws': 628, 'engaged': 338, 'roped': 919, 'caged': 170, 'hah': 490, 'boil': 119, 'hot': 555, 'steaming': 1075, 'kettle': 603, 'beezle': 73, 'nut': 772, 'oil': 776, 'full': 433, 'prove': 862, 'meeting': 695, 'everyone': 347, 'holler': 544, 'shout': 982, 'scream': 949, 'stew': 1078, 'scared': 947, 'people': 814, 'loudly': 666, 'smiled': 1024, 'clear': 219, 'bell': 78, 'surely': 1110, 'wind': 1294, 'distant': 285, 'voices': 1243, 'either': 329, 'neither': 743, 'grab': 465, 'shouted': 983, 'cage': 169, 'dope': 296, 'lasso': 624, 'stomach': 1083, 'ten': 1131, 'rope': 918, 'tie': 1166, 'knots': 613, 'lose': 661, 'dunk': 313, 'juice': 593, 'fought': 415, 'vigor': 1238, 'vim': 1240, 'gang': 439, 'many': 682, 'beat': 65, 'mauled': 686, 'haul': 511, 'managed': 681, 'die': 281, 'yourselves': 1340, 'tom': 1185, 'smack': 1018, 'whooped': 1281, 'racked': 882, 'rattled': 891, 'kettles': 604, 'brass': 137, 'pans': 802, 'garbage': 440, 'pail': 800, 'cranberry': 249, 'cans': 178, 'bazooka': 61, 'blasted': 108, 'toots': 1190, 'clarinets': 214, 'oom': 783, 'pahs': 799, 'flutes': 404, 'gusts': 486, 'loud': 665, 'racket': 883, 'rang': 888, 'sky': 1007, 'howling': 562, 'mad': 672, 'hullabaloo': 564, 'hey': 528, 'hows': 563, 'mine': 705, 'best': 86, 'working': 1311, 'anyone': 30, 'shirking': 975, 'rushed': 927, 'east': 321, 'west': 1267, 'seemed': 957, 'yipping': 1331, 'beeping': 70, 'bipping': 98, 'ruckus': 924, 'roar': 911, 'raced': 881, 'each': 315, 'building': 155, 'floor': 400, 'felt': 373, 'getting': 445, 'nowhere': 770, 'despair': 277, 'suddenly': 1104, 'burst': 161, 'door': 295, 'discovered': 283, 'shirker': 974, 'hidden': 530, 'fairfax': 356, 'apartments': 34, 'apartment': 33, '12': 0, 'jo': 589, 'standing': 1065, 'bouncing': 130, 'yo': 1332, 'yipp': 1330, 'chirp': 208, 'twerp': 1219, 'lad': 619, 'eiffelberg': 327, 'tower': 1197, 'towns': 1199, 'darkest': 267, 'time': 1171, 'blood': 113, 'aid': 11, 'country': 244, 'noises': 756, 'greater': 468, 'amounts': 24, 'counts': 245, 'thus': 1163, 'cleared': 220, 'yopp': 1335, 'extra': 351, 'finally': 379, 'proved': 863, 'world': 1313, 'smallest': 1020, 'true': 1211, 'planning': 835, 'summer': 1106, 'ish': 584, 'congratulations': 237, 'places': 833, 'brains': 135, 'yourself': 1339, 'direction': 282, 'own': 791, 'guy': 487, 'decide': 274, 'streets': 1094, 'em': 334, 'street': 1093, 'case': 184, 'straight': 1091, 'opener': 785, 'happen': 500, 'frequently': 423, 'brainy': 136, 'footsy': 412, 'along': 20, 'happening': 502, 'seeing': 955, 'sights': 990, 'join': 591, 'fliers': 397, 'soar': 1036, 'heights': 521, 'lag': 620, 'behind': 76, 'speed': 1057, 'pass': 805, 'soon': 1046, 'wherever': 1273, 'rest': 901, 'except': 349, 'sometimes': 1043, 'sadly': 930, 'bang': 53, 'ups': 1229, 'prickle': 859, 'ly': 671, 'perch': 816, 'lurch': 670, 'chances': 194, 'slump': 1015, 'un': 1221, 'slumping': 1016, 'easily': 320, 'marked': 683, 'mostly': 717, 'darked': 266, 'sprain': 1060, 'both': 126, 'elbow': 330, 'chin': 207, 'dare': 264, 'stay': 1073, 'win': 1293, 'turn': 1214, 'quarters': 874, 'sneak': 1029, 'simple': 992, 'afraid': 6, 'maker': 677, 'upper': 1228, 'confused': 236, 'race': 880, 'wiggled': 1290, 'roads': 910, 'break': 139, 'necking': 740, 'pace': 792, 'grind': 476, 'cross': 254, 'weirdish': 1263, 'space': 1054, 'headed': 515, 'useless': 1232, 'waiting': 1245, 'bus': 162, 'plane': 834, 'mail': 675, 'phone': 822, 'ring': 906, 'hair': 491, 'friday': 424, 'uncle': 1222, 'jake': 587, 'better': 88, 'pearls': 813, 'pair': 801, 'pants': 803, 'wig': 1289, 'curls': 261, 'chance': 193, 'escape': 341, 'staying': 1074, 'playing': 837, 'banner': 55, 'ride': 904, 'ready': 894, 'anything': 31, 'under': 1224, 'points': 845, 'scored': 948, 'magical': 674, 'winning': 1300, 'est': 342, 'winner': 1299, 'fame': 359, 'famous': 361, 'watching': 1257, 'tv': 1216, 'times': 1172, 'lonely': 657, 'cause': 188, 'against': 10, 'alone': 19, 'whether': 1274, 'meet': 694, 'scare': 946, 'road': 909, 'between': 89, 'hither': 542, 'yon': 1333, 'weather': 1262, 'foul': 416, 'enemies': 337, 'prowl': 864, 'hakken': 493, 'kraks': 618, 'howl': 561, 'onward': 782, 'frightening': 428, 'creek': 252, 'arms': 37, 'sneakers': 1030, 'leak': 631, 'hike': 533, 'face': 353, 'problems': 860, 'already': 21, 'strange': 1092, 'birds': 101, 'tact': 1115, 'remember': 899, 'life': 640, 'balancing': 49, 'act': 5, 'forget': 414, 'dexterous': 278, 'deft': 276, 'mix': 708, 'foot': 411, 'succeed': 1102, 'indeed': 577, '98': 2, 'percent': 815, 'guaranteed': 483, 'kid': 606, 'move': 726, 'mountains': 721, 'name': 736, 'buxbaum': 166, 'bixby': 104, 'bray': 138, 'mordecai': 713, 'ali': 14, 'van': 1235, 'allen': 17, 'shea': 970, 'mountain': 720, 'star': 1066, 'glad': 449, 'fat': 365, 'yellow': 1324, 'everywhere': 348, 'seven': 961, 'eight': 328, 'eleven': 333, 'ever': 345, 'wump': 1320, 'hump': 566, 'gump': 484, 'pull': 867, 'sticks': 1080, 'bike': 96, 'mike': 700, 'sits': 1002, 'work': 1310, 'hills': 535, 'hello': 522, 'cow': 248, 'cannot': 177, 'teeth': 1129, 'gold': 456, 'shoe': 977, 'story': 1090, 'told': 1184, 'nook': 762, 'cook': 240, 'moon': 712, 'sheep': 971, 'sleep': 1009, 'zans': 1341, 'gox': 464, 'ying': 1328, 'sings': 998, 'yink': 1329, 'wink': 1298, 'ink': 578, 'yop': 1334, 'finger': 383, 'brush': 154, 'comb': 232, 'pet': 820, 'met': 698, 'cats': 186, 'cut': 262, 'pets': 821, 'zeds': 1342, 'upon': 1227, 'heads': 516, 'haircut': 492, 'wave': 1259, 'swish': 1114, 'gack': 436, 'park': 804, 'clark': 215, 'zeep': 1343}

Specific Words

We can extract specific words from the vocabulary as follows:

vec.vocabulary_["fish"]
388
vec.vocabulary_["pop"]
850
vec.vocabulary_["eggs"]
326

Text Normalizing

What’s wrong with the way we counted words originally?

Counter({'UP': 1, 'PUP': 3, 'Pup': 4, 'is': 10, 'up.': 2, ...})
  • It’s usually good to normalize for punctuation and capitalization.

  • Normalization options are specified when you initialize the CountVectorizer().

  • By default, scikit-learn strips punctuation and converts all characters to lowercase.

Text Normalizing in sklearn

If you don’t want scikit-learn to normalize for punctuation and capitalization, you can do the following:

vec = CountVectorizer(lowercase = False, token_pattern = r"[\S]+")
vec.fit(docs.values())
CountVectorizer(lowercase=False, token_pattern='[\\S]+')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
vec.transform(docs.values())
<Compressed Sparse Row sparse matrix of dtype 'int64'
    with 3679 stored elements and shape (8, 2562)>

CountVectorizer()

Setting lowercase = False doesn’t automatically convert every word to lowercase. token_pattern = r"[\S]+" declares a regular expression that treats every sequence of characters that is not whitespace ([\S]) as a single token.

N-grams

The Shortcomings of Bag-of-Words

Bag-of-words is easy to understand and easy to implement. What are its disadvantages?

Consider the following documents:

  1. “The dog bit her owner.”

  2. “Her dog bit the owner.”

Both documents have the same exact bag-of-words representation, but they mean something quite different!

N-grams

  • An n-gram is a sequence of \(n\) words.

  • N-grams allow us to capture more of the meaning.

  • For example, if we count bigrams (2-grams) instead of words, we can distinguish the two documents from before:

  1. “The dog bit her owner.”

  2. “Her dog bit the owner.”

\[\begin{array}{l|ccccccc} & \text{the, dog} & \text{her, dog} & \text{dog, bit} & \text{bit, the} & \text{bit, her} & \text{the, owner} & \text{her, owner} \\ \hline \text{1} & 1 & 0 & 1 & 0 & 1 & 0 & 1 \\ \text{2} & 0 & 1 & 1 & 1 & 0 & 1 & 0 \\ \end{array}\]

N-grams in scikit-learn

We can easily modify our previous approach by specifying ngram_range = in CountVectorizer(). To get bigrams, we set the range to (2, 2).

Specify

vec = CountVectorizer(ngram_range = (2, 2))

Fit

vec.fit(docs.values())
CountVectorizer(ngram_range=(2, 2))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Transform

vec.transform(docs.values())
<Compressed Sparse Row sparse matrix of dtype 'int64'
    with 6459 stored elements and shape (8, 5846)>

N-grams in scikit-learn

… or we can also get individual words (unigrams) alongside the bigrams:

Specify

vec = CountVectorizer(ngram_range = (1, 2))

Fit

vec.fit(docs.values())
CountVectorizer(ngram_range=(1, 2))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Transform

vec.transform(docs.values())
<Compressed Sparse Row sparse matrix of dtype 'int64'
    with 8767 stored elements and shape (8, 7190)>

Text Data and Distances

Similar Documents

Now, we can use this bag-of-words data to measure similarities between documents!

from sklearn.metrics import pairwise_distances

dat = vec.transform(docs.values())

dists = pairwise_distances(dat)
dists
array([[  0.        , 204.84628383, 178.52730884, 219.79535937,
        168.42802617, 228.52570971, 183.5973856 , 178.83511959],
       [204.84628383,   0.        , 189.45711916, 157.15597348,
        186.2793601 , 152.32859219, 175.28262892, 156.12815249],
       [178.52730884, 189.45711916,   0.        , 171.66828478,
         95.21554495, 189.95262567, 141.52031656, 130.11533345],
       [219.79535937, 157.15597348, 171.66828478,   0.        ,
        163.84138671, 138.97481786, 174.56230979, 162.59766296],
       [168.42802617, 186.2793601 ,  95.21554495, 163.84138671,
          0.        , 188.92855793, 133.1990991 , 112.89818422],
       [228.52570971, 152.32859219, 189.95262567, 138.97481786,
        188.92855793,   0.        , 162.83120094, 164.95453919],
       [183.5973856 , 175.28262892, 141.52031656, 174.56230979,
        133.1990991 , 162.83120094,   0.        , 134.98888843],
       [178.83511959, 156.12815249, 130.11533345, 162.59766296,
        112.89818422, 164.95453919, 134.98888843,   0.        ]])

Similar Documents

dists[0].argsort()
array([0, 4, 2, 7, 6, 1, 3, 5])
docs.keys()
dict_keys(['green_eggs_and_ham.txt', 'cat_in_the_hat.txt', 'fox_in_socks.txt', 'how_the_grinch_stole_christmas.txt', 'hop_on_pop.txt', 'horton_hears_a_who.txt', 'oh_the_places_youll_go.txt', 'one_fish_two_fish.txt'])

Tip

This is how data scientists do authorship identification!

Lecture Activity 4.2 - Part 1

One Fish Two Fish

Using bi-grams, unigrams, and tri-grams, which Dr. Seuss document is closest to “One Fish Two Fish”?

Motivating Example

Issues with the Distance Approach

BUT WAIT!

  • Don’t we care more about word choice than total words used?

  • Wouldn’t a longer document have more words, and thus be able to “match” other documents?

  • Wouldn’t more common words appear in more documents, and thus cause them to “match”?

  • Recall: We have many options for scaling

  • Recall: We have many options for distance metrics.

Example

Document A:

“Whoever has hate for his brother is in the darkness and walks in the darkness.”

Document B:

“Hello darkness, my old friend, I’ve come to talk with you again.”

Document C:

“Returning hate for hate multiplies hate, adding deeper darkness to a night already devoid of stars. Darkness cannot drive out darkness; only light can do that.”

Document D:

“Happiness can be found in the darkest of times, if only one remembers to turn on the light.”

Example with Code

Code
documents = [
    "whoever has hate for his brother is in the darkness and walks in the darkness",
    "hello darkness my old friend",
    "returning hate for hate multiplies hate adding deeper darkness to a night already devoid of stars darkness cannot drive out darkness only light can do that",
    "happiness can be found in the darkest of times if only one remembers to turn on the light"
]
Code
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(token_pattern = r"\w+")
vec.fit(documents)
CountVectorizer(token_pattern='\\w+')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Code
bow_matrix = vec.transform(documents)
bow_matrix
<Compressed Sparse Row sparse matrix of dtype 'int64'
    with 56 stored elements and shape (4, 45)>

Example Output

bow_dataframe = pd.DataFrame(
  bow_matrix.todense(), 
  columns = vec.get_feature_names_out()
  )
bow_dataframe[["darkness", "light"]]
   darkness  light
0         2      0
1         1      0
2         3      1
3         0      1

Measuring Similarity

Code
from sklearn.metrics import pairwise_distances

dists = pairwise_distances(bow_matrix)
dists[0].argsort()
array([0, 1, 3, 2])

“Whoever has hate for his brother is in the darkness and walks in the darkness.”

“Hello darkness, my old friend, I’ve come to talk with you again.”

“Returning hate for hate multiplies hate, adding deeper darkness to a night already devoid of stars. Darkness cannot drive out darkness; only light can do that.”

“Happiness can be found in the darkest of times, if only one remembers to turn on the light.”

Cosine Distance

Choosing Your Distance Metric

Is euclidean distance really the best choice?!

My name is James Bond, James Bond is my name.

My name is James Bond.

My name is James.

  • If we count words the second two will be the most similar.

  • The first document is longer, so it has “double” counts.

  • But, it has the exact same words as the first document!

  • Solution: cosine distance (on board)

Cosine Distance

As a rule, cosine distance is a better choice for bag-of-words data!

from sklearn.metrics.pairwise import cosine_distances

dists = cosine_distances(bow_matrix)
dists[0].argsort()
array([0, 2, 3, 1])

“Whoever has hate for his brother is in the darkness and walks in the darkness.”

“Hello darkness, my old friend, I’ve come to talk with you again.”

“Returning hate for hate multiplies hate, adding deeper darkness to a night already devoid of stars. Darkness cannot drive out darkness; only light can do that.”

“Happiness can be found in the darkest of times, if only one remembers to turn on the light.”

TF-IDF

Measuring Similarity

Which of these seems most important for measuring similarity?

  • Document B, C, D all have the word “to”

  • Documents A, B, and C all have the word darkness.

  • Document A and Document C both have the word “hate”

  • Document C and Document D both have the word “light”

TF-IDF Scaling

We would like to scale our word counts by the document length (TF).


We would also like to scale our word counts by the number of documents they appear in. (IDF)

Document Lengths

If a document is longer, it is more likely to share words with other documents.

bow_totals = bow_dataframe.sum(axis = 1)
bow_totals
0    15
1     5
2    26
3    18
dtype: int64

Term Frequencies (TF)

Let’s use frequencies instead of counts.

bow_tf = bow_dataframe.divide(bow_totals, axis = 0)
bow_tf
          a    adding   already  ...      turn     walks   whoever
0  0.000000  0.000000  0.000000  ...  0.000000  0.066667  0.066667
1  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000
2  0.038462  0.038462  0.038462  ...  0.000000  0.000000  0.000000
3  0.000000  0.000000  0.000000  ...  0.055556  0.000000  0.000000

[4 rows x 45 columns]

Distance of Term Frequencies (TF)

dists = cosine_distances(bow_tf)
dists[0].argsort()
array([0, 2, 3, 1])

“Whoever has hate for his brother is in the darkness and walks in the darkness.”

“Hello darkness, my old friend, I’ve come to talk with you again.”

“Returning hate for hate multiplies hate, adding deeper darkness to a night already devoid of stars. Darkness cannot drive out darkness; only light can do that.”

“Happiness can be found in the darkest of times, if only one remembers to turn on the light.”

Inverse Document Frequency (IDF)

  • In principle, if two documents share rarer words they are more similar.

  • What matters is not overall word frequency but how many of the documents have that word.

bow_dataframe
   a  adding  already  and  be  brother  ...  the  times  to  turn  walks  whoever
0  0       0        0    1   0        1  ...    2      0   0     0      1        1
1  0       0        0    0   0        0  ...    0      0   0     0      0        0
2  1       1        1    0   0        0  ...    0      0   1     0      0        0
3  0       0        0    0   1        0  ...    2      1   1     1      0        0

[4 rows x 45 columns]

IDF - Step 1

First, isolate the words that occurred in each document.

has_word = (bow_dataframe > 0)
has_word[["darkness", "light", "hate"]]
   darkness  light   hate
0      True  False   True
1      True  False  False
2      True   True   True
3     False   True  False

IDF - Step 2

Then, let’s calculate how often the word occurred across the four documents.

bow_df = (
  has_word
  .sum(axis = 0) / 4
  )
bow_df
a             0.25
adding        0.25
already       0.25
and           0.25
be            0.25
brother       0.25
can           0.50
cannot        0.25
darkest       0.25
darkness      0.75
deeper        0.25
devoid        0.25
do            0.25
drive         0.25
for           0.50
found         0.25
friend        0.25
happiness     0.25
has           0.25
hate          0.50
hello         0.25
his           0.25
if            0.25
in            0.50
is            0.25
light         0.50
multiplies    0.25
my            0.25
night         0.25
of            0.50
old           0.25
on            0.25
one           0.25
only          0.50
out           0.25
remembers     0.25
returning     0.25
stars         0.25
that          0.25
the           0.50
times         0.25
to            0.50
turn          0.25
walks         0.25
whoever       0.25
dtype: float64

axis = 0

What values are we summing?

IDF - Step 3

Find the inverse document frequencies:

bow_log_idf = np.log(1 / bow_df)
bow_log_idf
a             1.386294
adding        1.386294
already       1.386294
and           1.386294
be            1.386294
brother       1.386294
can           0.693147
cannot        1.386294
darkest       1.386294
darkness      0.287682
deeper        1.386294
devoid        1.386294
do            1.386294
drive         1.386294
for           0.693147
found         1.386294
friend        1.386294
happiness     1.386294
has           1.386294
hate          0.693147
hello         1.386294
his           1.386294
if            1.386294
in            0.693147
is            1.386294
light         0.693147
multiplies    1.386294
my            1.386294
night         1.386294
of            0.693147
old           1.386294
on            1.386294
one           1.386294
only          0.693147
out           1.386294
remembers     1.386294
returning     1.386294
stars         1.386294
that          1.386294
the           0.693147
times         1.386294
to            0.693147
turn          1.386294
walks         1.386294
whoever       1.386294
dtype: float64

More than just the inverse!

Notice we are using \(log(\frac{1}{p_i})\) to get the IDFs.

IDF - Step 4

Adjust for the inverse document frequencies:

bow_tf[["darkness", "light", "hate"]]
   darkness     light      hate
0  0.133333  0.000000  0.066667
1  0.200000  0.000000  0.000000
2  0.115385  0.038462  0.115385
3  0.000000  0.055556  0.000000
bow_log_idf[["darkness", "light", "hate"]]
darkness    0.287682
light       0.693147
hate        0.693147
dtype: float64
bow_tf_idf = bow_tf.multiply(bow_log_idf, axis = 1)


bow_tf_idf[["darkness", "light", "hate"]]
   darkness     light      hate
0  0.038358  0.000000  0.046210
1  0.057536  0.000000  0.000000
2  0.033194  0.026660  0.079979
3  0.000000  0.038508  0.000000

TF-IDF Distances

dists = cosine_distances(bow_tf_idf).round(decimals = 2)
dists[0].argsort()
array([0, 3, 2, 1])

“Whoever has hate for his brother is in the darkness and walks in the darkness.”

“Hello darkness, my old friend, I’ve come to talk with you again.”

“Returning hate for hate multiplies hate, adding deeper darkness to a night already devoid of stars. Darkness cannot drive out darkness; only light can do that.”

“Happiness can be found in the darkest of times, if only one remembers to turn on the light.”

TF-IDF in sklearn

Specify

from sklearn.feature_extraction.text import TfidfVectorizer

# These options ensure that the numbers match our example above
vec = TfidfVectorizer(smooth_idf = False)

Fit

vec.fit(documents)
TfidfVectorizer(smooth_idf=False)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Transform

tfidf_matrix = vec.transform(documents)

Lecture Activity 4.2 - Part 2

Activity

Using bi-grams, unigrams, and tri-grams, which Dr. Seuss document is closest to “One Fish Two Fish”?

Takeaways

Takeaways

  • We represent text data as a bag-of-words or bag-of-n-grams matrix.

  • Each row is a document in the corpus.

  • We typically use cosine distance to measure similarity, because it captures patterns of word choice

  • We apply TF-IDF transformations to scale the bag-of-words data, so that words that appear in fewer documents are more important