SEMI-AUTOMATED CLASS NUMBER PREDICTION OF BIBLIOGRAPHICAL RESOURCES: A FRAMEWORK DEPLOYING ANNIF
Synopsis
This study investigates an AI/ML-based semi-automated indexing system for libraries to efficiently process large document collections. Using supervised learning within Python's Annif framework, we trained models on manually classified MARC bibliographic records organized by Dewey Decimal Classification (DDC) standards. The implementation involved collecting and processing records containing titles, summaries, DDC numbers and subject descriptors, then dividing them into training and test datasets. We evaluated four algorithms (TF-IDF, Omikuji, FastText and NN Ensemble) using standard retrieval metrics (F1@5 and NDCG), finding that Omikuji and NN Ensemble significantly outperformed the others in indexing accuracy. The complete open-source framework demonstrates the viability of machine learning for library classification tasks, offering an efficient alternative to manual indexing while maintaining accuracy. These results suggest promising applications for AI in knowledge organization systems, with potential for expansion to other classification schemes and larger datasets to further enhance performance.
Keywords: Supervised Machine Learning, Semi-Automated Classification, Automated Subject Indexing, DDC, Annif, Ensemble approach.