Journal of International Information Management


Heterogeneity and interoperability of Web data sources represent the current key issue in Web information extraction and integration. Warehouse approach and virtual approach are the common approaches adopted to integrate heterogeneous Web data sources. However, few analytic model and cost model were developed to measure and assess the efficiency and effectiveness of either approach or a combination. Hence, a contingency model cannot be produced to assist the search engine to select and mix the warehouse method and the virtual method. In this study, we present a genetic algorithm assisted hybrid approach to aid the search engine to evaluate the cost and performance factors. We apply genetic algorithm technique to formulate a cost optimization model and compute and compare the cost of extraction and integration. The cost model is based on a collection and compilation of the property data of the query analysis and path expression of the involved Web data sources. Six property analyses are conducted and six evolution steps are created to formulate the genetic algorithm of optimization. Further, we conduct a preliminary experiment using 15 local and global Web bookstores to install and test the method. Our experimental results show that the cost optimization can be achieved with the genetic algorithm and factor analysis.