Email Based Knowledge Management
Marketing 101. Consulting 101. PHP Consulting. Random geeky stuff. I Blog Therefore I Am.
Home | FuzzyGroup | About Us | Our Services |
Email Based Knowledge Management
Hi Zak,
[Disclaimer: I wrote like 90% of this on the plane back and then lost it in
a drafts folder. Sorry about the delay. Hope it's still useful. Let me
know if I can help at all. ]
-- Scott
Hi there,
It was nice to meet you at PHP Con, in the bar with Jeremy on the last
night. You had mentioned building a classification system and I tossd
out a few throw away ideas. This email is the promised clarification of
said ideas.
So here's the problem that you are trying to solve (as I understand it
from the few comments you made):
a) You have a distributed group of largely technical folk who have
information flowing between them
b) Information is in different forms but is largely text data
c) Search in and of itself probably isn't sufficient to solve the
information access and organization problem -- classification is needed.
You made an extremely accurate remark, just prior to commenting on
classification, on the difficulties of both mailing lists and IM -- i.e.
who to contact, watching for the response, the fear being thought
foolish (I think this is my addition, not yours; either way). You then
made a comment about engineers contributing to a repository and
"categorizing / classifying it on entry".
My experience is that this tends to work poorly if at all for these reasons:
a) Classification is just plain hard. It's easy for the person who
creates the schema or taxonomy but, of course, much, much harder for the
person trying to use it.
(Anyone who's ever tried to understand the organization of a deep class
hierarchy like SmallTalk can sympathize with this.)
b) If you only go with 1 categorization entry it's the same as "what
folder to I put this email in". And you correctly pointed out any
classification system would have to support multiple categories per
entry. This, of course, raises the issue of over classification.
c) A practical matter is that browsers are very, very poor at
efficiently displaying trees of information that are selectable. I've
done this with Java for classification tools (kill me know) and DHTML is
better but still not good.
d) Classifying things well takes time. And, if contributions to a
repository take time, it isn't done. Period.
Sidebar: A way to encourage repository contributions is incentivizing
the process by tying it to a person's career path. Some of the big
consulting firms have had luck with this but it's very, very hard since
that then raises the issue of rating contributions, feedback,
correctness, etc.
So with this said, is classification dead? Is KM viable in a small,
distributed, busy organization? Not at all. Early KM systems (Dataware
II KMS, etc) tended to be large, centralized, bulky systems that
operated in a disjoint fashion from the organization's day to day
business activities). KM was a "special activity". That's just plain
wrong. It's distressingly wrong (disclaimer: I was the product manager
and one of the architects of the Dataware II KMS).
Here's how I would solve this problem today:
a) Adopt email as a first class input. I'd probably be tempted to
abandon the idea of a web form as an input source entirely. Very little
knowledge today doesn't pass through email at one point or another.
b) Make contributions to the repository as simple as forwarding, cc'ing
or bcc'ing an existing message / document attachment / url to a known
address. Say km@mysql.com or "library@mysql.com".
c) Route those messages into a repository extracting from them metadata
as follows (this is a rough draft and would need more thought):
Subject ------- treat this as title
Contributor --- the sender
URLs ---------- I'd pull these into a separate table
Attachments --- I'd pull out type, size, filename into separate fields /
tables
OtherFolks ---- Any other people cc'd / authors on the message
Filenames ----- Since MySQL is a software company, being able to extract
from a message that there is a reference to queryoptimizer.c is probably
useful.
(Issues:
When the contributor of the article isn't the author who gets credit?
Depending on the complexity of signatures on the bottom of messages you
may need a "message zoning" algorithm that discards signatures
I'd also suggest that a supporting data table that references plain text
customer names to customer urls be part of the meta data system {or a
separate query expansion / query processor }. This would recognize that
anything with a url or email address that referenced "abcsystems.com" is
equivalent to "ABC Systems". This is very, very useful since we tend to
think and speak with the english not DNS at times.
If a customer has particular issues like "Jeremy = Large database |
Replication" then you could use this as a way to embed additional
classification tags.
)
d) Now, classification (you are probably thinking "Finally !"). There
are a couple of questions and issues to understand. The first issue is
simple: Key Words Suck Rocks. There are very few topics where a "key
word" really works well. This is particularly true in mature technical
fields where a vocabulary has grown up. Take my presentation on PHP
Login Security at this conference. Just the keyword Security is
pretty useless these days. It's much, much more effective to use something
like Porter's Stemming algorithm to generate key phrases. Look at search
results from this meta search engine http://www.queryserver.com/web.htm and
search for "MySQL" (no quotes) which uses Porter's Stemming algorithm (this
was a product of mine once upon a time).
If you want to automatically assign classifications then I'd recommend
implementing Porter's Stemming and mapping the keywords to the taxonomy
entries. Rather than doing it all at once I'd build a training interface
which lets it be done with human assistance for the first X time periods.
Trying to get it right up front usually doesn't work.
Porter's Stemming is a fairly well understood bit of code with
implementations in most languages. Results vary based on language and it
works best in english. The algorithm implementation details are below.
Well I'm probably boring you with this by this point. Let me know if
you want more information
Best
Scott
-- porters stemming implementation (I'm not sure what the legal issues are
with my giving you this since there have been companies sold, people laid
off, etc. If you reproduce this then please don't reference
www.queryserver.com. It should be fine but always best to be cautious).
It is based on both title and summary, with phrases from the title having a
higher weight.
Here is a full description of the algorithm, for your comments....
a) Source of Candidate Phrases
Phrases are extracted from the result's title and summary, with each
occurrence of the phrase from the former contributing a weight of 2+2*N, and
each occurrence from the latter contributing a weight of 2+N, where N is the
number of search services that returned the result.
The weight of a phrase is the sum of the weights of the occurrences of the
phrase in the results.
b) Extraction of Words
The text to extract phrases from is split into words. A word is a set of
adjacent alphanumeric characters, which can also contain a hyphen (but not
start or end with a hyphen), and which can also end with "'s" (but not end
with "-'s"). Each word is assigned a type (see below), an ownership flag
(whether it ends with "'s"), a stop word flag (see below), and a punctuation
flag (whether punctuation exists between the word and the next).
A word's type is a set of flags which indicate the type of characters
present in the word. The flags, which are ORed together, are as follows: 1
= first char is lowercase, 2 = first char is uppercase, 4 = first char is
digit; 8 = subsequent char is lowercase, 16 = subsequent char is uppercase,
32 = subsequent char is digit; 64 = word contains hyphen. The word type is
used in the detection of stop words, and in determining the best form of a
word for display purposes.
The stop word flag is set if the word has one of the following
characteristics:
(i) The word length is 1.
(ii) The word consists entirely of digits.
(iii) The word consists entirely of lowercase or uppercase letters (but not
a mixture), and either the word length is less than 4 or the word is in the
stop word list.
(iv) The word starts with an uppercase letter and contains only subsequent
lowercase letters, and the word is not at the beginning of a sentence, and
either the word length is less than 4 or the word is in the stop word list.
(v) The word consists entirely of lowercase or uppercase letters (but not a
mixture) and contains a hyphen, and the word is in the stop word list.
(vi) The word starts with an uppercase letter and contains only subsequent
lowercase letters and at least one hyphen, and the word is not at the
beginning of a sentence, and the word is in the stop word list.
The stop word list currently contains the following words:
"about", "above", "across", "after", "afterwards", "again", "against",
"almost", "alone", "along", "already", "also", "although", "always",
"among", "amongst", "amount", "another", "anybody", "anyhow", "anyone",
"anything", "anyway", "anywhere", "around", "became", "because", "become",
"becomes", "becoming", "been", "before", "beforehand", "behind", "below",
"beside", "besides", "between", "beyond", "both", "cannot", "could",
"couldn", "done", "down", "during", "each", "either", "else", "elsewhere",
"empty", "enough", "even", "ever", "every", "everybody", "everyone",
"everything", "everywhere", "except", "first", "former", "formerly", "from",
"full", "further", "hasn", "have", "hence", "here", "hereafter", "hereby",
"herein", "hereupon", "hers", "herself", "himself", "however", "inc",
"indeed", "into", "itself", "last", "latter", "latterly", "least", "less",
"many", "meanwhile", "more", "moreover", "most", "mostly", "much", "must",
"myself", "namely", "neither", "never", "nevertheless", "next", "nobody",
"none", "no-one", "nothing", "nowhere", "often", "only", "onto", "other",
"others", "otherwise", "ours", "ourselves", "perhaps", "please", "rather",
"same", "seem", "seemed", "seeming", "seems", "serious", "several",
"should", "shouldn", "since", "sincere", "some", "somehow", "someone",
"something", "sometime", "sometimes", "somewhat", "somewhere", "such",
"than", "that", "their", "them", "themselves", "then", "thence", "there",
"thereafter", "thereby", "therefore", "therein", "thereupon", "these",
"they", "this", "those", "though", "through", "throughout", "thru", "thus",
"together", "toward", "towards", "under", "until", "upon", "very", "were",
"what", "whatever", "when", "whence", "whenever", "where", "whereafter",
"whereas", "whereby", "wherein", "whereupon", "wherever", "whether",
"which", "while", "whither", "whoever", "whole", "whom", "whose", "will",
"with", "within", "without", "would", "your", "yours", "yourself",
"yourselves"
The beginning of a sentence is determined by the end of a word being
followed by either one of the letters
.!?:
or by one of the letters
')"
followed by one of the above letters.
c) Formation of Candidate Phrases
A four word window is moved through the text to extract phrases from. If
the first word in the window is a stop word or is followed by punctuation,
the window is moved on one word. Otherwise, if the second word is not a
stop word, then a two-word phrase is formed from the first and second words
in the window. If the second word is followed by punctuation, the window is
moved on one word. Otherwise, if the third word is not a stop word, then a
three-word phrase is formed from the first, second and third words in the
window. If the third word is followed by punctuation, the window is moved
on one word. Otherwise, if the fourth word is not a stop word, then a
four-word phrase is formed from the first, second, third and fourth words in
the window. Then the window is moved on one word. At the end of the text,
three and two word windows allow the last words in the text to form phrases.
Each phrase formed is added to a master phrase list. Two phrases are deemed
equal if the stems of their words are identical. Stemming is performed with
Porter's stemming algorithm, with a modification to remove "'s" where
present.
When adding a new occurrence of a phrase, the words in the phrase are
compared. The best form of words is kept, to improve the phrase display. A
"good" word is one whose ownership flag is not set, and which starts with an
uppercase letter and contains only subsequent lowercase letters.
d) Reduction of Candidate Phrases
At the end of the processing of the title and summary from each result, the
master phrase list contains a list of phrases, each with an associated
weight. Most of these phrases will be junk or subsets of other phrases, and
these have to be removed. The following methods are used, in the given
order:
(i) If phrase 1 starts or ends with phrase 2 (when comparing word stems),
and phrase 2's list of contributing results is a subset of phrase 1's list,
then phrase 2 is removed.
(ii) Phrases with only one contributing result are removed.
(iii) Phrases whose weight is less than 14 are removed.
(iv) For each of the results, only the three phrases with the greatest
weights are retained. All occurrences of other phrases are removed. [Note:
this is a weakness as the top four phrases may have identical weights]
(v) Phrases with only one contributing result are removed.
(vi) Phrases whose weight is less than 10 are removed.
(vii) For each of the results, only the phrase with the greatest weight is
retained. All occurrences of other phrases are removed. [Note: again,
this is a weakness as the top two phrases may have identical weights]
(viii) Phrases with only one contributing result are removed.
(ix) Phrases whose weight is less than 6 are removed.
This gradual trimming down of the number of phrases allows those results
whose top phrase eventually gets removed to be listed under their second or
third phrases.
e) Display of Phrases
Each phrase is listed with its contributing results. Any results whose
contributions have been removed from all the retained phrases are listed
under the "Other Sites" phrase.
If the reduction phase removes every phrase, then a normal list display will
be performed, as if clustering was switched off.
Some points...
a) The clustering only works properly in English, as it uses Porter's
stemming algorithm and an English stop word list.
b) The algorithm used to detect end of sentences could be improved.
c) The phrase reduction algorithm could be improved - currently I have seen
results which contain one of the phrases listed, but which are listed under
"Other Sites".
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
Company: The FuzzyGroup, Inc.
What We Do: Quality web development / eVectors IdeaTools VAR
Title: President
Phone: 617 588 0613 / 617 201 4337 cell
Email: sjohnson@fuzzygroup.com
Site: http://www.fuzzygroup.net/
Blog: http://www.fuzzyblog.com/
Yahoo IM: fuzzygroup
AOL IM: fuzzygroup
Emergency: mobile@fuzzygroup.com
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
This Page was last update: 4/6/2003; 3:13:59 AM
Copyright 2003 The FuzzyStuff
Theme Design by Bryan Bell
Posted In: #email #knowledge_management