Wir verwenden Cookies und Analyse-Tools, um die Nutzerfreundlichkeit der Internet-Seite zu verbessern und für Marketingzwecke. Wenn Sie fortfahren, diese Seite zu verwenden, nehmen wir an, dass Sie damit einverstanden sind. Zur Datenschutzerklärung.
Automatic construction of labeled clusters of named entities for IR
Details
In this study we have tried to harvest labeled clusters of semantically similar named entities which can be used as a first step for web document clustering. We first collect ~44,000 named entities from a thesaurus which is constructed by Dekang Lin applying a word similarity measure based on their distributional pattern. Using their similarity metrics and CLUTO clustering software, we create 2000 semantically similar clusters of the named entities. Then we collect ~305,500 label-instance pairs from the 2007 English Wikipedia dump and implement a labeling algorithm presented by Benjamin Van Durme and M.Pasça (2008) to assign a label to the clusters. This automatic lableing task is able to assign a label which describes the majority of the named entities in 924 of the clusters, which is 46.2% of the total clusters. Finally we evaluate both the clustering and labeling tasks taking 86 randomly selected clusters and on the bases of two native English speaker evaluators subjective judgment. According to these evaluators, the clustering task has a purity score of 0.7 and 55% of the labels are acceptable with different degree of accuracy.
Autorentext
I have studied computer science at Mekelle University, Ethiopia.Immediately after graduation, I got teachingassistant job in the same department where I studied. I have worked in the University for two years and awarded ErasmusMundus scholarship to continue my master s study in Language andcommunication technology.
Weitere Informationen
- Allgemeine Informationen
- GTIN 09783844334722
- Sprache Englisch
- Größe H220mm x B150mm x T4mm
- Jahr 2011
- EAN 9783844334722
- Format Kartonierter Einband
- ISBN 3844334726
- Veröffentlichung 23.05.2011
- Titel Automatic construction of labeled clusters of named entities for IR
- Autor Henock Tilahun Teffera
- Untertitel Thesis for European Master's in Language and Communication Technology
- Gewicht 113g
- Herausgeber LAP LAMBERT Academic Publishing
- Anzahl Seiten 64
- Genre Informatik