Sunday, November 04, 2007

Authority records and the Semantic Web

Since we're thinking a lot about authority control and what a post-ILS library system would look like the Twine system in my previous post is pretty interesting stuff. It makes me think about the relationship of authority control to these efforts to extract meaning from massive quantities of text. I'm going to ramble a bit and my knowledge about all of this is a mile wide and an inch deep so, as my PoliSci professor Gil Cuthberson used to say, don't take all this too seriously. I'm just trying to figure out what i think about all this. It's a real puzzle to me.

EXAMPLE: In a 1949 article in SW J of Anthropology, Wm Welmers cites the words Pessi, Kpessi, Kpwessi, Gberese, and Guerze as variants for Kpelle the name for a Liberian ethnic group. Kpelle is the form most commonly cited in the English literature and it is the authorized name in LoC authority records. Guerze is the form most commonly cited in the French Literature. In the online version of the LoC authority records I could only find these two versions. The other terms were never found in WorldCat as names of an ethnic group (Pessi is the name of a Finnish author and WC had 9 hits for his works.)

My point is not to disparage the work done by the excellent catalogers at LoC nor say that we can or should abandon this work. The authority work already in existence should serve a very useful purpose in building a semantic knowledgebase. That work could serve as a scaffolding/anchoring system. A word that exists in a name authority record is or at least can be thought of as something that we name. Or perhaps as a systems to aid in ranking. If words are linked in LC authority records they get an extra boost in relevance.

But we have to recognize the limits of human effort and appreciate the bounty of indexing massive quantities of text. My ability to come up with the example I did is based on the fact that I have ridiculously deep/focused knowledge about the Kpelle and am acquainted with the concept of LC authority records.

OK, so the stuff we have now is useful. It represents millions of hours of human intellectual labor to produce it. Do we still need to continue the work? I would argue that yes, we do need to continue. The effort needs to be cooperative and we won't be able to do equivalent labor on everything in the world.


still thinkin'....

1 comment:

Maureen said...

This is interesting and thought-provoking. I haven't had a chance to read the blog for a while, so I am catchin' up! IMHO there are times when controlled vocabulary comes in mighty handy (searching the medical and health care literature comes to mind): at other times I want to use natural language and have the power to get at every word in the record. I would really like to see databases like Worldcat become more Googlelike without losing the wonderful LCSH and MESH and whatever other subject headings it has.

And amen to your likening your knowledge as being pretty shallow on this subject: mine too! But the answers to these qs will shape the profession. Thanks.