API Reference¶

Spammy Classes¶

spammy.classifier module¶

Rolling my own Implementation of Naive Bayes algorithm.

This particular implementation caters to the case when a category is not observed in the dataset, and the model automatically assigns a 0 probability to it!

Read about smoothening techniques somewhere but let’s not delve into that now.

References¶

[1]

class spammy.classifier.NaiveBayesClassifier[source]¶

Bases: object

Inherits from the ‘object’ class. Nothing special

classify(features)[source]¶

Writing the actual interface for the class here. This will classify our documents when called from the terminal

Parameters:	self – class object features – The feaures of the document passed
Returns:	spam or ham
Return type:	str

document_probability(features, label)[source]¶

Finds document_probability() by looping over the documents and calling feature_probability()

Parameters:	self – class object features – List of features label – Label whose probability needs to be classified
Returns:	the probability of the document in being in a particular class
Return type:	float/int

feature_probability(feature, label)[source]¶

This function calculates the probability of a feature to belong to a particular label. (i.e class of ‘spam’ or ‘ham’ for us.)

Note

for an unseen featurem I can assign a random probability, let’s say 0.5

Parameters:	self – class object feature – The feature for which we will be calculating the probailty. label – spam or ham
Returns:	The probability of the feature being in the label.
Return type:	float

train(featurelist, label)[source]¶

Trains the classifier for gods sake!

Trying to emulate the API which the NLTK wrapper tries to provide for its nltk.NaiveBayesClassifier.train() gives

Note

defaultdict is used bacause when we try to acces a key which is not there in the dictionary, we get a KeyError. Whereas in defaultdict.It will try to return a default value if the key is not found.

For more on defaultdict, Refer: http://stackoverflow.com/a/5900634/3834059

Parameters:	self – class object featurelist – the list of the features label – class of the feature

spammy.exceptions module¶

exception spammy.exceptions.CorpusFileError[source]¶

Bases: spammy.exceptions.SpammyError

Raised when the one of the corpus files passed to the spammy ctor do not exist

OR

When we do not pass any file to the ctor for initialization

exception spammy.exceptions.LimitError[source]¶

Bases: spammy.exceptions.SpammyError

raised when the limit passed is either less than 0

exception spammy.exceptions.SpammyError[source]¶

Bases: exceptions.Exception

A Spammy related error

spammy.exceptions.SpammyException¶: alias of SpammyError

spammy.train module¶

Trainer class for the classifier

class spammy.train.Trainer(directory, spam, ham, limit)[source]¶

Bases: object

The trainer class

extract_features(text)[source]¶

Will convert the document into tokens and extract the features.

Note

So these are some possible features which would make an email a SPAM

features looked for - Attachments - Links in text - CAPSLOCK words - Numbers - Words in text

Parameters:	self – Trainer object text – Email text from which we will extract features
Returns:	A list which contains the feature set
Return type:	list

train()[source]¶

Starts the training process on the directories passed by the user

Parameters:	self – Trainer object

train_classifier(path, label)[source]¶

The function doing the actual classification here.

Parameters:	self – Trainer object path – The path of the data directory label – The label underwhich the data directory is

spammy.version module¶

Module contents¶

class spammy.Spammy(directory=None, limit=None, **kwargs)[source]¶

Bases: object

Stiches everything from train module and classifier module together

accuracy(**kwargs)[source]¶

Checks the accuracy of the classifier by running it against a testing corpus

Parameters:	limit – number of files the classifier should test upon label – the label as in spam or ham directory – The absolute path of the directory to be tested
Returns:	the precision of the classifier. Eg: 0.87
Return type:	float
Example:	>>> from spammy import Spammy >>> directory = '/home/tasdik/Dropbox/projects/spammy/examples/training_dataset' >>> cl = Spammy(directory, limit=300) # training on only 300 spam and ham files >>> cl.train() >>> cl.accuracy(directory='/home/tasdik/Dropbox/projects/spammy/examples/test_dataset', label='spam', limit=300) 0.9554794520547946 >>> cl.accuracy(directory='/home/tasdik/Dropbox/projects/spammy/examples/test_dataset', label='ham', limit=300) 0.9033333333333333 >>>

classify(email_text)[source]¶

tries classifying text into spam or ham

Parameters:	email_text – email_text to be passed here which is to be classified
Returns:	Either ham or spam
Return type:	str

Note

To be run after you have trained the classifier object on your dataset

Example:

>>> from spammy import Spammy
>>> cl = Spammy(path_to_trainin_data, limit=200)
# 200 or the number of files you need to train the classifier upon
>>>  
>>> HAM_TEXT =             '''
Bro. Hope you are fine. Hows the work going on ? Can you send me some updates on it.
And are you free tomorrow ?
No problem man. But please make sure you are finishing it 
by friday night and sending me on on that day itself. As we 
have to get it printed on Saturday.
'''
>>> cl.classify(HAM_TEXT)
'ham'

train()[source]¶

Trains the classifier object

Parameters:

self – the classifier object

Example:

>>> from spammy import Spammy
>>> directory = '/home/tasdik/Dropbox/projects/spammy/examples/training_dataset'
>>> cl = Spammy(directory, limit=300)  # training on only 300 spam and ham files
>>> cl.train()