The FuzzyBlog! : Email Based Knowledge Management The FuzzyBlog!

Marketing 101. Consulting 101. PHP Consulting. Random geeky stuff. I Blog Therefore I Am.

Home FuzzyGroup About Us Our Services

Email Based Knowledge Management

Hi Zak,

[Disclaimer: I wrote like 90% of this on the plane back and then lost it in

a drafts folder. Sorry about the delay. Hope it's still useful. Let me

know if I can help at all. ]

-- Scott

Hi there,

It was nice to meet you at PHP Con, in the bar with Jeremy on the last

night. You had mentioned building a classification system and I tossd

out a few throw away ideas. This email is the promised clarification of

said ideas.

So here's the problem that you are trying to solve (as I understand it

from the few comments you made):

a) You have a distributed group of largely technical folk who have

information flowing between them

b) Information is in different forms but is largely text data

c) Search in and of itself probably isn't sufficient to solve the

information access and organization problem -- classification is needed.

You made an extremely accurate remark, just prior to commenting on

classification, on the difficulties of both mailing lists and IM -- i.e.

who to contact, watching for the response, the fear being thought

foolish (I think this is my addition, not yours; either way). You then

made a comment about engineers contributing to a repository and

"categorizing / classifying it on entry".

My experience is that this tends to work poorly if at all for these reasons:

a) Classification is just plain hard. It's easy for the person who

creates the schema or taxonomy but, of course, much, much harder for the

person trying to use it.

(Anyone who's ever tried to understand the organization of a deep class

hierarchy like SmallTalk can sympathize with this.)

b) If you only go with 1 categorization entry it's the same as "what

folder to I put this email in". And you correctly pointed out any

classification system would have to support multiple categories per

entry. This, of course, raises the issue of over classification.

c) A practical matter is that browsers are very, very poor at

efficiently displaying trees of information that are selectable. I've

done this with Java for classification tools (kill me know) and DHTML is

better but still not good.

d) Classifying things well takes time. And, if contributions to a

repository take time, it isn't done. Period.

Sidebar: A way to encourage repository contributions is incentivizing

the process by tying it to a person's career path. Some of the big

consulting firms have had luck with this but it's very, very hard since

that then raises the issue of rating contributions, feedback,

correctness, etc.

So with this said, is classification dead? Is KM viable in a small,

distributed, busy organization? Not at all. Early KM systems (Dataware

II KMS, etc) tended to be large, centralized, bulky systems that

operated in a disjoint fashion from the organization's day to day

business activities). KM was a "special activity". That's just plain

wrong. It's distressingly wrong (disclaimer: I was the product manager

and one of the architects of the Dataware II KMS).

Here's how I would solve this problem today:

a) Adopt email as a first class input. I'd probably be tempted to

abandon the idea of a web form as an input source entirely. Very little

knowledge today doesn't pass through email at one point or another.

b) Make contributions to the repository as simple as forwarding, cc'ing

or bcc'ing an existing message / document attachment / url to a known

address. Say km@mysql.com or "library@mysql.com".

c) Route those messages into a repository extracting from them metadata

as follows (this is a rough draft and would need more thought):

Subject ------- treat this as title

Contributor --- the sender

URLs ---------- I'd pull these into a separate table

Attachments --- I'd pull out type, size, filename into separate fields /

tables

OtherFolks ---- Any other people cc'd / authors on the message

Filenames ----- Since MySQL is a software company, being able to extract

from a message that there is a reference to queryoptimizer.c is probably

useful.

(Issues:

When the contributor of the article isn't the author who gets credit?

Depending on the complexity of signatures on the bottom of messages you

may need a "message zoning" algorithm that discards signatures

I'd also suggest that a supporting data table that references plain text

customer names to customer urls be part of the meta data system {or a

separate query expansion / query processor }. This would recognize that

anything with a url or email address that referenced "abcsystems.com" is

equivalent to "ABC Systems". This is very, very useful since we tend to

think and speak with the english not DNS at times.

If a customer has particular issues like "Jeremy = Large database |

Replication" then you could use this as a way to embed additional

classification tags.

)

d) Now, classification (you are probably thinking "Finally !"). There

are a couple of questions and issues to understand. The first issue is

simple: Key Words Suck Rocks. There are very few topics where a "key

word" really works well. This is particularly true in mature technical

fields where a vocabulary has grown up. Take my presentation on PHP

Login Security at this conference. Just the keyword Security is

pretty useless these days. It's much, much more effective to use something

like Porter's Stemming algorithm to generate key phrases. Look at search

results from this meta search engine http://www.queryserver.com/web.htm and

search for "MySQL" (no quotes) which uses Porter's Stemming algorithm (this

was a product of mine once upon a time).

If you want to automatically assign classifications then I'd recommend

implementing Porter's Stemming and mapping the keywords to the taxonomy

entries. Rather than doing it all at once I'd build a training interface

which lets it be done with human assistance for the first X time periods.

Trying to get it right up front usually doesn't work.

Porter's Stemming is a fairly well understood bit of code with

implementations in most languages. Results vary based on language and it

works best in english. The algorithm implementation details are below.

Well I'm probably boring you with this by this point. Let me know if

you want more information

Best

Scott

-- porters stemming implementation (I'm not sure what the legal issues are

with my giving you this since there have been companies sold, people laid

off, etc. If you reproduce this then please don't reference

www.queryserver.com. It should be fine but always best to be cautious).

It is based on both title and summary, with phrases from the title having a

higher weight.

 

Here is a full description of the algorithm, for your comments....

a) Source of Candidate Phrases

Phrases are extracted from the result's title and summary, with each

occurrence of the phrase from the former contributing a weight of 2+2*N, and

each occurrence from the latter contributing a weight of 2+N, where N is the

number of search services that returned the result.

The weight of a phrase is the sum of the weights of the occurrences of the

phrase in the results.

b) Extraction of Words

The text to extract phrases from is split into words. A word is a set of

adjacent alphanumeric characters, which can also contain a hyphen (but not

start or end with a hyphen), and which can also end with "'s" (but not end

with "-'s"). Each word is assigned a type (see below), an ownership flag

(whether it ends with "'s"), a stop word flag (see below), and a punctuation

flag (whether punctuation exists between the word and the next).

A word's type is a set of flags which indicate the type of characters

present in the word. The flags, which are ORed together, are as follows: 1

= first char is lowercase, 2 = first char is uppercase, 4 = first char is

digit; 8 = subsequent char is lowercase, 16 = subsequent char is uppercase,

32 = subsequent char is digit; 64 = word contains hyphen. The word type is

used in the detection of stop words, and in determining the best form of a

word for display purposes.

The stop word flag is set if the word has one of the following

characteristics:

(i) The word length is 1.

(ii) The word consists entirely of digits.

(iii) The word consists entirely of lowercase or uppercase letters (but not

a mixture), and either the word length is less than 4 or the word is in the

stop word list.

(iv) The word starts with an uppercase letter and contains only subsequent

lowercase letters, and the word is not at the beginning of a sentence, and

either the word length is less than 4 or the word is in the stop word list.

(v) The word consists entirely of lowercase or uppercase letters (but not a

mixture) and contains a hyphen, and the word is in the stop word list.

(vi) The word starts with an uppercase letter and contains only subsequent

lowercase letters and at least one hyphen, and the word is not at the

beginning of a sentence, and the word is in the stop word list.

The stop word list currently contains the following words:

"about", "above", "across", "after", "afterwards", "again", "against",

"almost", "alone", "along", "already", "also", "although", "always",

"among", "amongst", "amount", "another", "anybody", "anyhow", "anyone",

"anything", "anyway", "anywhere", "around", "became", "because", "become",

"becomes", "becoming", "been", "before", "beforehand", "behind", "below",

"beside", "besides", "between", "beyond", "both", "cannot", "could",

"couldn", "done", "down", "during", "each", "either", "else", "elsewhere",

"empty", "enough", "even", "ever", "every", "everybody", "everyone",

"everything", "everywhere", "except", "first", "former", "formerly", "from",

"full", "further", "hasn", "have", "hence", "here", "hereafter", "hereby",

"herein", "hereupon", "hers", "herself", "himself", "however", "inc",

"indeed", "into", "itself", "last", "latter", "latterly", "least", "less",

"many", "meanwhile", "more", "moreover", "most", "mostly", "much", "must",

"myself", "namely", "neither", "never", "nevertheless", "next", "nobody",

"none", "no-one", "nothing", "nowhere", "often", "only", "onto", "other",

"others", "otherwise", "ours", "ourselves", "perhaps", "please", "rather",

"same", "seem", "seemed", "seeming", "seems", "serious", "several",

"should", "shouldn", "since", "sincere", "some", "somehow", "someone",

"something", "sometime", "sometimes", "somewhat", "somewhere", "such",

"than", "that", "their", "them", "themselves", "then", "thence", "there",

"thereafter", "thereby", "therefore", "therein", "thereupon", "these",

"they", "this", "those", "though", "through", "throughout", "thru", "thus",

"together", "toward", "towards", "under", "until", "upon", "very", "were",

"what", "whatever", "when", "whence", "whenever", "where", "whereafter",

"whereas", "whereby", "wherein", "whereupon", "wherever", "whether",

"which", "while", "whither", "whoever", "whole", "whom", "whose", "will",

"with", "within", "without", "would", "your", "yours", "yourself",

"yourselves"

The beginning of a sentence is determined by the end of a word being

followed by either one of the letters

.!?:

or by one of the letters

')"

followed by one of the above letters.

c) Formation of Candidate Phrases

A four word window is moved through the text to extract phrases from. If

the first word in the window is a stop word or is followed by punctuation,

the window is moved on one word. Otherwise, if the second word is not a

stop word, then a two-word phrase is formed from the first and second words

in the window. If the second word is followed by punctuation, the window is

moved on one word. Otherwise, if the third word is not a stop word, then a

three-word phrase is formed from the first, second and third words in the

window. If the third word is followed by punctuation, the window is moved

on one word. Otherwise, if the fourth word is not a stop word, then a

four-word phrase is formed from the first, second, third and fourth words in

the window. Then the window is moved on one word. At the end of the text,

three and two word windows allow the last words in the text to form phrases.

Each phrase formed is added to a master phrase list. Two phrases are deemed

equal if the stems of their words are identical. Stemming is performed with

Porter's stemming algorithm, with a modification to remove "'s" where

present.

When adding a new occurrence of a phrase, the words in the phrase are

compared. The best form of words is kept, to improve the phrase display. A

"good" word is one whose ownership flag is not set, and which starts with an

uppercase letter and contains only subsequent lowercase letters.

d) Reduction of Candidate Phrases

At the end of the processing of the title and summary from each result, the

master phrase list contains a list of phrases, each with an associated

weight. Most of these phrases will be junk or subsets of other phrases, and

these have to be removed. The following methods are used, in the given

order:

(i) If phrase 1 starts or ends with phrase 2 (when comparing word stems),

and phrase 2's list of contributing results is a subset of phrase 1's list,

then phrase 2 is removed.

(ii) Phrases with only one contributing result are removed.

(iii) Phrases whose weight is less than 14 are removed.

(iv) For each of the results, only the three phrases with the greatest

weights are retained. All occurrences of other phrases are removed. [Note:

this is a weakness as the top four phrases may have identical weights]

(v) Phrases with only one contributing result are removed.

(vi) Phrases whose weight is less than 10 are removed.

(vii) For each of the results, only the phrase with the greatest weight is

retained. All occurrences of other phrases are removed. [Note: again,

this is a weakness as the top two phrases may have identical weights]

(viii) Phrases with only one contributing result are removed.

(ix) Phrases whose weight is less than 6 are removed.

This gradual trimming down of the number of phrases allows those results

whose top phrase eventually gets removed to be listed under their second or

third phrases.

e) Display of Phrases

Each phrase is listed with its contributing results. Any results whose

contributions have been removed from all the retained phrases are listed

under the "Other Sites" phrase.

If the reduction phase removes every phrase, then a normal list display will

be performed, as if clustering was switched off.

 

 

Some points...

a) The clustering only works properly in English, as it uses Porter's

stemming algorithm and an English stop word list.

b) The algorithm used to detect end of sentences could be improved.

c) The phrase reduction algorithm could be improved - currently I have seen

results which contain one of the phrases listed, but which are listed under

"Other Sites".

 

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Company: The FuzzyGroup, Inc.

What We Do: Quality web development / eVectors IdeaTools VAR

Title: President

Phone: 617 588 0613 / 617 201 4337 cell

Email: sjohnson@fuzzygroup.com

Site: http://www.fuzzygroup.net/

Blog: http://www.fuzzyblog.com/

Yahoo IM: fuzzygroup

AOL IM: fuzzygroup

Emergency: mobile@fuzzygroup.com

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

 

 

 

This Page was last update: 4/6/2003; 3:13:59 AM

Copyright 2003 The FuzzyStuff

Theme Design by Bryan Bell

Click here to visit the Radio UserLand website. Subscribe to "The FuzzyBlog!" in Radio UserLand. Click to see the XML version of this web page. Click here to send an email to the editor of this weblog.