Path: mv.asterisco.pt!mvalente
From: mvale…@ruido-visual.pt (Mario Valente)
Newsgroups: mv
Subject: Classifier+Tagger=ClassiTagger
Date: Sat, 02 Jun 07 21:47:21 GMT
Imagine that I have a /home dir full of documents of
several different types (DOC, PDF, PPT, TXT, etc) and
although I have some specific folders for specific stuff
most of these documents need to be classified (or tagged)
into specific folders (or not, if you use tags).
Although I have scoured the Internet for some utility
or app that did it for me automatically (its *boooring*
going through each file, open it, analyze content and
decide on which folder it should go), I have found
nothing of the kind. No, I dont want some app that indexes
the contents and lets me search: I want an app that
looks through the files and moves them to specific folders
(created on demand by the app itself); or just tags them
according to the content.
Anyone knows of something like that? Suggestions to
mvalente@ruido-visual.pt and/or mfvalente@gmail.com.
— MV
PS – I’ve even gotten so desperate as to start hacking a
script in Python to do it for me; below follows the code
for the ClassiTagger….
88
from operator import itemgetter
MINFREQUENCY=5
MAXNGRAMTAGS=10
filename=’texto.txt’
stopwords={}
liststopwords=(‘the’,’a’,’an’,’and’,’or’,’not’,’if’,’then’,’else’,’i’,’you’,’he’,’she’,’we’,’them’,’us’,’to’,\
‘your’,’of’,’off’,’is’,’in’,’on’,’for’,’that’,’this’,’can’,’have’,’are’,’it’,’be’,’at’,\
‘with’,’will’,’use’,’do’,’see’,’as’,’which’,’from’,’by’,’should’,’into’,’some’,’these’,\
‘when’,’what’,’but’,’other’,’may’,’all’,’has’,’my’,’out’,’make’,’sure’,’like’,’get’,\
‘so’,’one’,’how’,’when’,’after’,’before’,’*’,’+’,’about’,’any’,’look’,’no’,’yes’,\
‘where’,’who’,’there’,’here’,’same’,’dont’,’more’,’than’,’also’,’up’,’down’,’must’,’yet’,’many’,’why’\
‘was’,’is’,’his’,’her’,\
“don’t”,”doesn’t”,”you’ll”,”it’s”)
for word in liststopwords:
stopwords[word]=1
print “CLASSITAGGER”
print “Classifying/Tagging file”,filename,”\n”
# Read file
inFile = file(filename, ‘r’)
content = inFile.read()
inFile.close()
#Split by words
words = content.split()
#Extract N-grams
tags={}
tags[(words[0].lower(),)]=1
tags[(words[1].lower(),)]=1
tags[(words[0].lower(),words[1].lower())]=1
i=2
while i =MINFREQUENCY and not (stopwords.has_key(ngram[0]) or stopwords.has_key(ngram[1]) or stopwords.has_key(ngram[2])):
print len(ngram), ngram, count
maxngramtags=maxngramtags+1
if maxngramtags==MAXNGRAMTAGS: break
maxngramtags=0
for ngram,count in sorted(tags.items(), key=itemgetter(1), reverse=True):
if len(ngram)==2 and count>=MINFREQUENCY and not (stopwords.has_key(ngram[0]) or stopwords.has_key(ngram[1])):
print len(ngram), ngram, count
maxngramtags=maxngramtags+1
if maxngramtags==MAXNGRAMTAGS: break
maxngramtags=0
for ngram,count in sorted(tags.items(), key=itemgetter(1), reverse=True):
if len(ngram)==1 and count>=MINFREQUENCY and not stopwords.has_key(ngram[0]):
print len(ngram), ngram, count
maxngramtags=maxngramtags+1
if maxngramtags==MAXNGRAMTAGS: break
88