Clone wiki

Ontobuilder / OntoM

OntoM - Ontology Alignment / Schema Matching

Background

Schema matching is the task of providing correspondences between concepts describing the meaning of data in various heterogeneous, distributed data sources (e.g., attributes in database schemata, tags in XML DTDs, fields in HTML forms, etc.) Schema matching is recognized to be one of the basic operations required by the process of data and schema integration [1–3], and thus has a great impact on its outcome. Roughly speaking, schema matchers receive as input two or more schemata, compute a measure of similarity between attributes, and suggest as output a possible set of correspondences between attributes.The main objective of schema matchers is to provide schema matchings that will be effective from the user point of view yet computationally efficient (or at least not disastrously expensive).

The outcome of the matching process can serve in tasks of targeted content delivery,view integration, database integration, query rewriting over heterogeneous sources, duplicate data elimination, and automatic streamlining of workflow activities that involve heterogeneous data sources. As such, schema matching has impact on numerous modern applications from various application areas. It impacts business, where company data sources continuously realign due to changing markets. It also impacts the way business and other information consumers seek information over the Web. It impacts life sciences, where scientific workflows cross system boundaries more often than not. Finally, it impacts the way communities of knowledge are created and evolve.

Related Research

Schema matching research has been going on for more than 25 years now (see surveys [1,4–6] and online lists,e.g., OntologyMatching, Ziegler, DigiCULT, SWgr). First as part of schema integration and then as a standalone research. Over the years, a significant body of work was devoted to the identification of schema matchers, heuristics for schema matching. Examples include COMA [7], Cupid [8], OntoBuilder [9], Autoplex [10], Similarity Flooding [11], Clio [12], Glue [13], and others [14–16]. Such research has evolved in different research communities, including databases, information retrieval, information sciences, data semantics and the semantic Web, and others.

Ontobuilder Services for Schema Matching

Ontobuilder lets you choose the algorithm used to perform the matching process:

In order to match two ontologies select: Ontology -> Ontology Merge Wizard...

Ontology Merge Wizard

Description of built-in first-line matchers:

Term: Term matching compares labels and names to identify syntactically similar terms. To achieve better performance, terms are preprocessed using several techniques originating in IR research. Term matching is based on either complete word or string comparison. As an example, consider the terms "airline information" and flight "airline info", which after concatenating and removing white spaces become "airlineinformation" and "flightairlineinfo", respectively. The maximum common substring is "airlineinfo", and the similarity of the two terms is length(airlineinfo)/length(airlineinfomation) = 11/18 = 61%.

Value: Value matching utilizes domain constraints (e.g., drop lists, check boxes, and radio buttons). It becomes valuable when comparing two terms that do not exactly match through their labels. For example, consider attributes "Dropoff Date" and "Return Date". These two terms have associated value sets {(Select),1,2,y,31} and {(Day),1,2,y,31}, respectively, and thus their contentbased similarity is 31/33 = 94%, which improves significantly over their term similarity (4(Date)/11(DropoffDate) = 36%).

Composition: A composite term is composed of other terms (either atomic or composite). Composition can be translated into a hierarchy. This schema matcher assigns similarity to terms, based on the similarity of their neighbors. The Cupid matcher [8], for example, is based on term composition.

Precedence: The order in which data are provided in an interactive process is important. In particular, data given at an earlier stage may restrict the options for a later entry. For example, a hotel chain site may determine which room types are available using the information given regarding the check-in location and time. Therefore, once those entries are filled in, the information is sent back to the server and the next form is brought up. Such precedence relationships can usually be identified by the activation of a script, such as the one associated with a SUBMIT button. Precedence relationships can be translated into a precedence graph.

Description of built-in First-line matchers:

Term Match:

Value Match:

Graph Match:

Precedence Match:

similarityFlooding Match:

Term and Value: A weighted combination of the Term and Value matchers. Here, the input to the matcher involves similarity matrices.

Combined: A weighted combination of the Term, Value,Composition, and Precedence matchers.

Description of built-in Second-line matchers:

Max Weighted Bipartite Graph:

Stable Marriage: Stable Marige Wiki

Dominants:

Intersection:

Union:

References

[1] C. Batini, M. Lenzerini, S. Navathe, A comparative analysis of methodologies for database schema integration, ACM Computing Surveys 18 (4) (1986) 323–364.

[2] M. Lenzerini, Data integration: a theoretical perspective, in: Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), 2002, pp. 233–246.

[3] P. Bernstein, S. Melnik, Meta data management, in: Proceedings of the IEEE CS International Conference on Data Engineering, IEEE Computer Society, Boston, MA, USA, 2004.

[4] A. Sheth, J. Larson, Federated database systems for managing distributed, heterogeneous, and autonomous databases, ACM Computing Surveys 22 (3) (1990) 183–236.

[5] E. Rahm, P. Bernstein, A survey of approaches to automatic schema matching, VLDB Journal 10 (4) (2001) 334–350.

[6] P. Shvaiko, J. Euzenat, A survey of schema-based matching approaches, Journal of Data Semantics 4 (2005) 146–171.

[7] H. Do, E. Rahm, COMA—a system for flexible combination of schema matching approaches, in: Proceedings of the International Conference on Very Large Data Bases (VLDB), 2002, pp. 610–621.

[8] J. Madhavan, P. Bernstein, E. Rahm, Generic schema matching with Cupid, in: Proceedings of the International Conference on Very Large Data Bases (VLDB), Rome, Italy, 2001, pp. 49–58.

[9] A. Gal, G. Modica, H. Jamil, A. Eyal, Automatic ontology matching using application semantics, AI Magazine 26 (1) (2005) 21–32. [10] J. Berlin, A. Motro, Autoplex: automated discovery of content for virtual databases, in: C. Batini, F. Giunchiglia, P. Giorgini (Eds.), Cooperative Information Systems, 9th International onference, CoopIS 2001, September 5–7, 2001, Proceedings, Lecture Notes in Computer Science, vol. 2172, Springer, Trento, Italy, 2001, pp. 108–122.

[11] S. Melnik, E. Rahm, P. Bernstein, Rondo: a programming platform for generic model management, in: Proceedings of the ACMSIGMOD Conference on Management of Data (SIGMOD), ACM Press,San Diego, CA, 2003, pp. 193–204.

[12] R. Miller, M. Hernandez, L. Haas, L.-L. Yan, C. Ho, R. Fagin, L. Popa, The Clio project: anaging heterogeneity, SIGMOD Record 30 (1)(2001) 78–83.

[13] A. Doan, J. Madhavan, P. Domingos, A. Halevy, Learning to map between ontologies on the semantic web, in: Proceedings of the 11th International Conference on World Wide Web, ACM Press,Honolulu, Hawaii, USA, 2002, pp. 662–673.

[14] S. Bergamaschi, S. Castano, M. Vincini, D. Beneventano, Semantic integration of heterogeneous information sources, Data & Knowledge Engineering 36 (3) (2001).

[15] S. Castano, V.D. Antonellis, S.D.C. di Vimercati, Global viewing of heterogeneous data sources, IEEE Transactions on Knowledge and Data Engineering 13 (2) (2001) 277–297.

[16] K. Saleem, Z. Bellahsene, E. Hunt, Performance oriented schema matching, in: 18th International Conference on Database and Expert Systems Applications (DEXA 2007), Springer, Regensburg, Germany, 2007, pp. 844–853.

Updated