Abstract:
In universities, institutes or any organizations, managing a large amount of documents and extracting useful information is always a big challenge. There are some requirements for automatically extracting name of participants in different kinds of contracts, agreements, and resolutions so that these kinds of documents can be restored and retrieved scientifically. Contracts, agreements, and resolutions are semi-structured documents and are often written in a formal style. Although named entity recognition has been studied for nearly twenty years, there is limited consideration for this domain.
In our research, we exploited and combined the features of semi-structured texts and formal writing style when recognizing and classifying named entities. We proposed a semi-supervised learning by combining CRF method, dictionaries and heuristic rules. We have applied our method in MoU/ MoA corpus and get acceptable result (F-measure can reach 88.01%). Besides, the method is flexible in that it can be applied not only for MoU/ MoA corpus but also for other semi-structured texts by providing corresponding dictionaries. We hope that with this study’s result, perspective AIT students will continue and complete the system which can be used by External Relations and Communications Office, significantly contributing to the development of AIT.