Path: mv.asterisco.pt!mvalente
From: mvale…@ruido-visual.pt (Mario Valente)
Newsgroups: mv
Subject: Classifier+Tagger=ClassiTagger
Date: Sat, 02 Jun 07 21:47:21 GMT
  Imagine that I have a /home dir full of documents of
 several different types (DOC, PDF, PPT, TXT, etc) and
 although I have some specific folders for specific stuff
 most of these documents need to be classified (or tagged)
 into specific folders (or not, if you use tags).
  Although I have scoured the Internet for some utility
 or app that did it for me automatically (its *boooring*
 going through each file, open it, analyze content and
 decide on which folder it should go), I have found
 nothing of the kind. No, I dont want some app that indexes
 the contents and lets me search: I want an app that
 looks through the files and moves them to specific folders
 (created on demand by the app itself); or just tags them
 according to the content.
  Anyone knows of something like that? Suggestions to
 mvalente@ruido-visual.pt and/or mfvalente@gmail.com.
  — MV
PS – I’ve even gotten so desperate as to start hacking a
 script in Python to do it for me; below follows the code
 for the ClassiTagger….
88
from operator import itemgetter
MINFREQUENCY=5
MAXNGRAMTAGS=10
filename=’texto.txt’
stopwords={}
liststopwords=(‘the’,’a’,’an’,’and’,’or’,’not’,’if’,’then’,’else’,’i’,’you’,’he’,’she’,’we’,’them’,’us’,’to’,\
               ‘your’,’of’,’off’,’is’,’in’,’on’,’for’,’that’,’this’,’can’,’have’,’are’,’it’,’be’,’at’,\
               ‘with’,’will’,’use’,’do’,’see’,’as’,’which’,’from’,’by’,’should’,’into’,’some’,’these’,\
               ‘when’,’what’,’but’,’other’,’may’,’all’,’has’,’my’,’out’,’make’,’sure’,’like’,’get’,\
               ‘so’,’one’,’how’,’when’,’after’,’before’,’*’,’+’,’about’,’any’,’look’,’no’,’yes’,\
               ‘where’,’who’,’there’,’here’,’same’,’dont’,’more’,’than’,’also’,’up’,’down’,’must’,’yet’,’many’,’why’\
               ‘was’,’is’,’his’,’her’,\
               “don’t”,”doesn’t”,”you’ll”,”it’s”)
for word in liststopwords:
  stopwords[word]=1
print “CLASSITAGGER”
print “Classifying/Tagging file”,filename,”\n”
# Read file
inFile = file(filename, ‘r’)
content = inFile.read()
inFile.close()
#Split by words
words = content.split()
#Extract N-grams
tags={}
tags[(words[0].lower(),)]=1
tags[(words[1].lower(),)]=1
tags[(words[0].lower(),words[1].lower())]=1
i=2
while i =MINFREQUENCY and not (stopwords.has_key(ngram[0]) or stopwords.has_key(ngram[1]) or stopwords.has_key(ngram[2])):
    print len(ngram), ngram, count
    maxngramtags=maxngramtags+1
    if maxngramtags==MAXNGRAMTAGS: break
maxngramtags=0
for ngram,count in sorted(tags.items(), key=itemgetter(1), reverse=True):
  if len(ngram)==2 and count>=MINFREQUENCY and not (stopwords.has_key(ngram[0]) or stopwords.has_key(ngram[1])):
    print len(ngram), ngram, count
    maxngramtags=maxngramtags+1
    if maxngramtags==MAXNGRAMTAGS: break
maxngramtags=0
for ngram,count in sorted(tags.items(), key=itemgetter(1), reverse=True):
  if len(ngram)==1 and count>=MINFREQUENCY and not stopwords.has_key(ngram[0]):
    print len(ngram), ngram, count
    maxngramtags=maxngramtags+1
    if maxngramtags==MAXNGRAMTAGS: break
88