Conceptually, data integration takes place in three steps:
Extracting the data from the various source systems using adapters.
Converting data from one format or structure into another while being enriched, improved, merged and/or cleansed at the same time.
Loading the transformed data into a new structure in a database or data warehouse, ready for use by reporting and analysis tools.
Depending on where the transformations take place, we speak of ETL or ELT. The first has a separate transformation engine and the other uses the power of the database to perform these transformations.
One can manually create scripts for data integration, but this quickly results in a ‘spaghetti’ of data connections. With ever increasing amounts of data sources, this becomes impossible to maintain.
By using specific data integration software, data can be ingested from the various sources via connectors. Mappings are then used to model metadata (data about data). These models are subsequently executed by the transformation engine, which loads the integrated data into the target system.
The advantage is that once a combination of data sources from one or more systems has been made using an ELT process, adding a data source from another system becomes a lot easier.
Data silos refer to heterogeneous data sources that store data in specific, often disjointed locations and legacy systems. Data integration can help the cross functionality of data, bringing proprietary, legacy data into new systems that can easily be accessed by any team member within the organisation.
“Create once, deliver to many”. By creating a central data source, data users within the company will all be able to access the same information which can reduce the number of questions asked, increase the speed of data access, and limit the possibility of having erroneous replicated data. Authors can continue to use systems of their choice while end users can access what they need from a central location.
A common problem with using various systems to collect data or having many people collect data is ontology issues. This means having multiple types of data that describe the same thing but are organised differently. An example of this may be the way that dates are stored (“DD/MM/YYYY”, “MM/DD/YYYY”, “Month Day, Year”, etc.). By removing variations and creating a structured data warehouse, it will be easier to find data and analyse patterns more efficiently.