All Shorts Short

Data Lake

A storage system for raw data in any format, where structure is applied at query time, not at ingestion.

Kunwar "AKA" AJ Sharing what I have learned

Jan 4, 2026 • 6 min • Data Engineering

A data lake is a storage system that accepts data in any format — structured tables, JSON logs, images, PDFs, raw sensor readings — and stores it exactly as it arrived. Unlike a data warehouse that validates and structures data before storing it, a data lake stores first and asks questions later.

This approach exists because not every piece of data has an obvious use at the time it is generated. Server logs, user behavior streams, IoT sensor readings — these might become valuable months or years later when a data scientist discovers a pattern. A data lake preserves that raw material, ensuring nothing is lost or pre-filtered before you know what questions to ask.

What Actually Happens Inside a Data Lake

When data enters a data lake, nothing validates it. Nothing transforms it. The raw bytes are stored exactly as they arrived. Structure is applied later, only when you query. Let us trace what this means in practice.

Data arrives from multiple sources:

data-ingestion.txt

JSON logs from web server    → Stored as-is (raw JSON files)
CSV exports from vendor      → Stored as-is (raw CSV files)
Images from security cameras → Stored as-is (raw image files)
PDFs from contracts          → Stored as-is (raw PDF files)
API responses from partners  → Stored as-is (raw JSON/XML)

No schema required. No type checking. No validation. The data lake does not care what format the data is in — it just stores the bytes. This is fundamentally different from a data warehouse, where data must pass through ETL validation before it can enter.

At query time, the engine figures out structure on the fly:

query-execution.txt

Query engine reads raw files
  → Parses structure on the fly
  → Handles missing fields gracefully
  → Infers data types from content
  → Returns whatever it can extract

Example: "SELECT user_id, action FROM web_logs WHERE date = '2024-01-15'"
  → Engine opens JSON files for that date
  → Parses each JSON object
  → Extracts user_id and action fields
  → Skips records where fields are missing

Key Insight

A data lake stores first, asks questions later. Structure is not enforced at write time — it is applied at read time. This is called schema-on-read. The cost of this flexibility is slower queries, because the engine must figure out the data structure every time you ask a question.

The Key Difference: Schema-on-Read

Data lakes apply structure when reading, not when writing. Let us trace what this means compared to a warehouse:

schema-on-read.txt

Schema-on-Write (Data Warehouse):
  Raw data → Validate → Transform → Store structured data
  Slow to write, fast to read

Schema-on-Read (Data Lake):
  Raw data → Store immediately → Apply structure at query time
  Fast to write, slower to read

This means queries are slower because the engine figures out structure on every read. But storage is fast and flexible — you never reject data, and you never need to know in advance what questions you will ask. The data is there, waiting for whatever question someone comes up with later.

Data Warehouse vs Data Lake

Aspect	Data Warehouse	Data Lake
Schema	Schema-on-Write (validate first)	Schema-on-Read (validate at query time)
Storage speed	Slower (validation and transformation required)	Fast (dump raw data immediately)
Query speed	Fast (data is pre-structured)	Slower (structure applied at read time)
Data quality	Enforced at entry — bad data rejected	Applied at query — bad data stored alongside good
Flexibility	Low (rigid schema, changes require migration)	High (any format, any structure)
Cost	Higher (optimized storage, compute for ETL)	Lower (bulk storage, minimal compute on write)
Best for	Known, recurring questions	Unknown, exploratory questions

Many organizations use both: raw data lands in the data lake for safekeeping, then curated and validated subsets flow into the data warehouse for daily business reporting. This way, analysts get fast queries on clean data, and data scientists get access to the full raw dataset for exploration.

The Data Swamp Problem

A data lake without governance becomes a data swamp — a massive collection of files that nobody can find, understand, or trust. This happens when teams dump data without documentation, naming conventions, or access controls.

data-swamp-vs-data-lake.txt

Data Swamp:
  /data/export_2023_final_v2_FIXED.csv     ← What is this?
  /data/john_backup_temp.json              ← Who is John? Why temp?
  /data/old/new/data.parquet               ← Old or new?

Data Lake (governed):
  /raw/crm/customers/2024/01/15/full.json  ← Source, date, type clear
  /raw/web/events/2024/01/15/hourly/       ← Partitioned, discoverable
  /curated/sales/monthly_summary.parquet   ← Cleaned, documented

The difference is organization: clear naming conventions, consistent folder structures, metadata catalogs, and access controls. A data lake is not a dumping ground. It is a strategic asset that requires governance to remain useful.

When to Use a Data Lake

Use a data lake when:

You do not know what questions you will ask yet — the data needs to be preserved for future discovery
Data structure varies or evolves frequently across sources
You need to store raw data for compliance, audit trails, or future reprocessing
Data comes from many sources with different formats (JSON, CSV, images, logs)
Data scientists need access to unfiltered, raw datasets for machine learning and exploration

A data lake is not a dumping ground. It is a strategic asset that preserves optionality — the ability to ask questions you have not thought of yet. But without governance, it becomes a swamp.

Mental Model: The Storage Unit

Think of a data lake like a well-labeled storage unit. You can put anything in — boxes, furniture, equipment, documents. Finding something specific takes more time than a filing cabinet because you have to look through containers. But you never have to decide upfront where things go, you never have to throw anything away because it does not fit a category, and you always have the original item in its original condition when you need it later.

The key word is “well-labeled.” An unlabeled storage unit where boxes are stacked randomly is a nightmare. A storage unit where every box has a date, a source, and a content description is a valuable archive. The same is true for data lakes.

The Trade-off

Flexibility and preservation in exchange for query complexity. You can store anything without planning, but finding and understanding that data requires more effort at query time. Governance is not optional — it is what separates a data lake from a data swamp.

What Actually Happens Inside a Data Lake

The Key Difference: Schema-on-Read

Data Warehouse vs Data Lake

The Data Swamp Problem

When to Use a Data Lake

Mental Model: The Storage Unit

Related Shorts

Data Warehouse vs Data Lake

Data Warehouse

Get essays in your inbox