API Reference¶
Spammy Classes¶
spammy.classifier module¶
Rolling my own Implementation of Naive Bayes algorithm.
This particular implementation caters to the case when a category is not observed in the dataset, and the model automatically assigns a 0 probability to it!
Read about smoothening techniques somewhere but let’s not delve into that now.
References¶
- [1]
-
class
spammy.classifier.
NaiveBayesClassifier
[source]¶ Bases:
object
Inherits from the ‘object’ class. Nothing special
-
classify
(features)[source]¶ Writing the actual interface for the class here. This will classify our documents when called from the terminal
Parameters: - self – class object
- features – The feaures of the document passed
Returns: spam or ham
Return type: str
-
document_probability
(features, label)[source]¶ Finds
document_probability()
by looping over the documents and callingfeature_probability()
Parameters: - self – class object
- features – List of features
- label – Label whose probability needs to be classified
Returns: the probability of the document in being in a particular class
Return type: float/int
-
feature_probability
(feature, label)[source]¶ This function calculates the probability of a feature to belong to a particular label. (i.e class of ‘spam’ or ‘ham’ for us.)
Note
for an unseen featurem I can assign a random probability, let’s say 0.5
Parameters: - self – class object
- feature – The feature for which we will be calculating the probailty.
- label – spam or ham
Returns: The probability of the feature being in the label.
Return type: float
-
train
(featurelist, label)[source]¶ Trains the classifier for gods sake!
Trying to emulate the API which the NLTK wrapper tries to provide for its nltk.NaiveBayesClassifier.train() gives
Note
defaultdict
is used bacause when we try to acces a key which is not there in thedictionary
, we get aKeyError
. Whereas indefaultdict
.It will try to return a default value if the key is not found.For more on
defaultdict
, Refer: http://stackoverflow.com/a/5900634/3834059Parameters: - self – class object
- featurelist – the list of the features
- label – class of the feature
-
spammy.exceptions module¶
-
exception
spammy.exceptions.
CorpusFileError
[source]¶ Bases:
spammy.exceptions.SpammyError
Raised when the one of the corpus files passed to the spammy ctor do not exist
OR
When we do not pass any file to the ctor for initialization
-
exception
spammy.exceptions.
LimitError
[source]¶ Bases:
spammy.exceptions.SpammyError
raised when the limit passed is either less than 0
-
spammy.exceptions.
SpammyException
¶ alias of
SpammyError
spammy.train module¶
Trainer class for the classifier
-
class
spammy.train.
Trainer
(directory, spam, ham, limit)[source]¶ Bases:
object
The trainer class
-
extract_features
(text)[source]¶ Will convert the document into tokens and extract the features.
Note
So these are some possible features which would make an email a SPAM
features looked for - Attachments - Links in text - CAPSLOCK words - Numbers - Words in text
Parameters: - self – Trainer object
- text – Email text from which we will extract features
Returns: A list which contains the feature set
Return type: list
-
spammy.version module¶
Module contents¶
-
class
spammy.
Spammy
(directory=None, limit=None, **kwargs)[source]¶ Bases:
object
Stiches everything from train module and classifier module together
-
accuracy
(**kwargs)[source]¶ Checks the accuracy of the classifier by running it against a testing corpus
Parameters: - limit – number of files the classifier should test upon
- label – the label as in spam or ham
- directory – The absolute path of the directory to be tested
Returns: the precision of the classifier. Eg: 0.87
Return type: float
Example: >>> from spammy import Spammy >>> directory = '/home/tasdik/Dropbox/projects/spammy/examples/training_dataset' >>> cl = Spammy(directory, limit=300) # training on only 300 spam and ham files >>> cl.train() >>> cl.accuracy(directory='/home/tasdik/Dropbox/projects/spammy/examples/test_dataset', label='spam', limit=300) 0.9554794520547946 >>> cl.accuracy(directory='/home/tasdik/Dropbox/projects/spammy/examples/test_dataset', label='ham', limit=300) 0.9033333333333333 >>>
-
classify
(email_text)[source]¶ tries classifying text into spam or ham
Parameters: email_text – email_text to be passed here which is to be classified Returns: Either ham or spam Return type: str Note
To be run after you have trained the classifier object on your dataset
Example: >>> from spammy import Spammy >>> cl = Spammy(path_to_trainin_data, limit=200) # 200 or the number of files you need to train the classifier upon >>> >>> HAM_TEXT = ''' Bro. Hope you are fine. Hows the work going on ? Can you send me some updates on it. And are you free tomorrow ? No problem man. But please make sure you are finishing it by friday night and sending me on on that day itself. As we have to get it printed on Saturday. ''' >>> cl.classify(HAM_TEXT) 'ham'
-