Data Lake

A storage system for raw data in any format, where structure is applied at query time, not at ingestion.

D
Kunwar "AKA" AJ Sharing what I have learned
Jan 4, 2026 6 min Data Engineering
data lake

A data lake is a storage system that accepts data in any format — structured tables, JSON logs, images, PDFs, raw sensor readings — and stores it exactly as it arrived. Unlike a data warehouse that validates and structures data before storing it, a data lake stores first and asks questions later.

This approach exists because not every piece of data has an obvious use at the time it is generated. Server logs, user behavior streams, IoT sensor readings — these might become valuable months or years later when a data scientist discovers a pattern. A data lake preserves that raw material, ensuring nothing is lost or pre-filtered before you know what questions to ask.

What Actually Happens Inside a Data Lake

When data enters a data lake, nothing validates it. Nothing transforms it. The raw bytes are stored exactly as they arrived. Structure is applied later, only when you query. Let us trace what this means in practice.

Data arrives from multiple sources:

data-ingestion.txt
JSON logs from web server    → Stored as-is (raw JSON files)
CSV exports from vendor      → Stored as-is (raw CSV files)
Images from security cameras → Stored as-is (raw image files)
PDFs from contracts          → Stored as-is (raw PDF files)
API responses from partners  → Stored as-is (raw JSON/XML)

No schema required. No type checking. No validation. The data lake does not care what format the data is in — it just stores the bytes. This is fundamentally different from a data warehouse, where data must pass through ETL validation before it can enter.

At query time, the engine figures out structure on the fly:

query-execution.txt
Query engine reads raw files
  → Parses structure on the fly
  → Handles missing fields gracefully
  → Infers data types from content
  → Returns whatever it can extract

Example: "SELECT user_id, action FROM web_logs WHERE date = '2024-01-15'"
  → Engine opens JSON files for that date
  → Parses each JSON object
  → Extracts user_id and action fields
  → Skips records where fields are missing
Key Insight

A data lake stores first, asks questions later. Structure is not enforced at write time — it is applied at read time. This is called schema-on-read. The cost of this flexibility is slower queries, because the engine must figure out the data structure every time you ask a question.

The Key Difference: Schema-on-Read

Data lakes apply structure when reading, not when writing. Let us trace what this means compared to a warehouse:

schema-on-read.txt
Schema-on-Write (Data Warehouse):
  Raw data → Validate → Transform → Store structured data
  Slow to write, fast to read

Schema-on-Read (Data Lake):
  Raw data → Store immediately → Apply structure at query time
  Fast to write, slower to read

This means queries are slower because the engine figures out structure on every read. But storage is fast and flexible — you never reject data, and you never need to know in advance what questions you will ask. The data is there, waiting for whatever question someone comes up with later.

Data Warehouse vs Data Lake

Aspect Data Warehouse Data Lake
Schema Schema-on-Write (validate first) Schema-on-Read (validate at query time)
Storage speed Slower (validation and transformation required) Fast (dump raw data immediately)
Query speed Fast (data is pre-structured) Slower (structure applied at read time)
Data quality Enforced at entry — bad data rejected Applied at query — bad data stored alongside good
Flexibility Low (rigid schema, changes require migration) High (any format, any structure)
Cost Higher (optimized storage, compute for ETL) Lower (bulk storage, minimal compute on write)
Best for Known, recurring questions Unknown, exploratory questions

Many organizations use both: raw data lands in the data lake for safekeeping, then curated and validated subsets flow into the data warehouse for daily business reporting. This way, analysts get fast queries on clean data, and data scientists get access to the full raw dataset for exploration.

The Data Swamp Problem

A data lake without governance becomes a data swamp — a massive collection of files that nobody can find, understand, or trust. This happens when teams dump data without documentation, naming conventions, or access controls.

data-swamp-vs-data-lake.txt
Data Swamp:
  /data/export_2023_final_v2_FIXED.csv     ← What is this?
  /data/john_backup_temp.json              ← Who is John? Why temp?
  /data/old/new/data.parquet               ← Old or new?

Data Lake (governed):
  /raw/crm/customers/2024/01/15/full.json  ← Source, date, type clear
  /raw/web/events/2024/01/15/hourly/       ← Partitioned, discoverable
  /curated/sales/monthly_summary.parquet   ← Cleaned, documented

The difference is organization: clear naming conventions, consistent folder structures, metadata catalogs, and access controls. A data lake is not a dumping ground. It is a strategic asset that requires governance to remain useful.

When to Use a Data Lake

Use a data lake when:

  • You do not know what questions you will ask yet — the data needs to be preserved for future discovery
  • Data structure varies or evolves frequently across sources
  • You need to store raw data for compliance, audit trails, or future reprocessing
  • Data comes from many sources with different formats (JSON, CSV, images, logs)
  • Data scientists need access to unfiltered, raw datasets for machine learning and exploration

A data lake is not a dumping ground. It is a strategic asset that preserves optionality — the ability to ask questions you have not thought of yet. But without governance, it becomes a swamp.

Mental Model: The Storage Unit

Think of a data lake like a well-labeled storage unit. You can put anything in — boxes, furniture, equipment, documents. Finding something specific takes more time than a filing cabinet because you have to look through containers. But you never have to decide upfront where things go, you never have to throw anything away because it does not fit a category, and you always have the original item in its original condition when you need it later.

The key word is “well-labeled.” An unlabeled storage unit where boxes are stacked randomly is a nightmare. A storage unit where every box has a date, a source, and a content description is a valuable archive. The same is true for data lakes.

The Trade-off

Flexibility and preservation in exchange for query complexity. You can store anything without planning, but finding and understanding that data requires more effort at query time. Governance is not optional — it is what separates a data lake from a data swamp.