Abstract:
Extensible Markup Language (XML) is a new Web specification especially designed for delivering structure content over the Web and currently plays an increasingly significant role in the Web application and the data interchange format. XML documents can optionally include rules to restrict the structure of elements and attributes in Document Type Definition (DTD) or XML schema, which provide a way to validate the structure and content of documents. However, DTD is not compulsory and its creation from scratch presents some complications. Therefore, this research aims to provide a learning mechanism to obtain quality DTD from a set of XML instances. We present an innovative concept by introducing the star height of the variables into our process for precisely inferring ?, +, * meta characters and enabling regular expression pattern detection between input sequences. Along with the factoring, reduction and generalization step, a concise meaningful DTD can be inferred by the learning mechanism. Experiments are carried out to demonstrate the effectiveness of the mechanism and compare it efficiency with that of the existing approaches.