A data lake is a centralised repository that allows storage of structured and unstructured data at any scale. Data can be stored as-is, without having to first structure it.
A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video). It can be established "on premises" (within an organisation's data centers) or "in the cloud".
The ability to harness more data, from more sources, in less time, and the ability to empower users to collaborate and analyse data in different ways, leads to better and faster decision making.
A data warehouse is a database optimised to analyse relational data coming from transactional systems and line of business applications. The data structure, and schema are defined in advance to optimise for fast operational reporting and analysis.
A data lake is different, because it stores relational data from line of business applications, and non-relational data from mobile apps, IoT devices, and social media. The structure of the data or schema is not defined when data is captured. This means you can store all of your data without careful design or the need to know what questions you might need answers for in the future.
As organisations with data warehouses see the benefits of data lakes, they are evolving their warehouse to include data lakes, and enable diverse query capabilities, data science use-cases, and advanced capabilities for discovering new information models.
The main challenge with a data lake architecture is that raw data is stored with no oversight of the contents. As data scientists determine per use case (query on the fly) which data is needed, a data lake needs to have defined mechanisms to catalog, and secure data.
Without these elements, data cannot be found, or trusted resulting in a “data swamp." Meeting the needs of wider audiences require data lakes to have governance, semantic consistency, and access controls.