Database Management System
DBMS, SQL, Data Warehousing & Data Mining
Introduction to Data & DBMS
Data, information, databases — and why DBMS exists
Data
Raw, unprocessed facts and figures without context. Examples: 85, 'Ram', 2025. On their own, these numbers and names carry no meaning.
Information
Data that has been processed and organised to give it meaning and context. Example: 'Ram scored 85 marks in the 2025 examination' is information derived from the raw data above.
Database
An organised, structured collection of related data stored together so it can be easily accessed, managed, updated, and retrieved. Example: a college's student records — names, roll numbers, marks, addresses — all stored and interlinked.
DBMS (Database Management System)
Software that creates, maintains, and controls access to a database. It acts as an interface between the database and the users or application programs that need the data. Examples: MySQL, Oracle, Microsoft Access, MongoDB, PostgreSQL.
Advantages of DBMS over Traditional File System
Eliminates Data Redundancy
File systems store the same data in multiple separate files, causing duplication. DBMS centralises data so each fact is stored once and shared.
Ensures Data Consistency
A single update in DBMS reflects everywhere instantly. File systems require manual synchronisation across files, leading to contradictory values.
Data Sharing
Multiple users and applications can access the same database concurrently with controlled, role-based permissions.
Data Security
DBMS enforces access control — different users get different read/write permissions on specific tables or even individual fields.
Data Integrity
Constraints (primary key, NOT NULL, CHECK) ensure only valid data is stored. Traditional file systems have no built-in data validation.
Backup & Recovery
DBMS provides automatic backup mechanisms and transaction logs that allow full restoration of data after hardware failure or corruption.
Key DBMS Concepts
| Term | Definition | Example |
|---|---|---|
| Table (Relation) | A 2D structure with rows and columns that stores data about one entity type | Student table with columns: StudentID, Name, GPA, Department |
| Record (Row / Tuple) | A single, complete entry in a table — one instance of the entity | Row: (101, "Ram Sharma", 3.8, "BCA") |
| Field (Column / Attribute) | A single piece of information category stored for every record | The "GPA" column in the Student table |
| Primary Key | A field (or combination) that uniquely identifies each record; cannot be NULL or duplicate | StudentID — every student has a unique roll number |
| Foreign Key | A field in one table referencing the primary key of another table; creates a link between tables | CourseID in the Enrollment table references the Course table |
| Candidate Key | Any field that could serve as a primary key (must be unique and not null); one is chosen as the primary key | Both StudentID and Email qualify as candidate keys |
Database Architecture
3-tier schema, data independence, DBA and user types
ANSI/SPARC 3-Level Architecture
The three-schema architecture separates the database into three levels so that changes at one level do not affect the others. This separation is the foundation of data independence.
External Level (View)
What each individual user or application sees. Different users can have different views of the same underlying data. Also called the View Level or User Schema.
A student sees only their own marks; an administrator sees all students across all courses.
Conceptual Level (Logical)
Describes the entire logical structure of the whole database — all tables, fields, relationships, constraints, and security rules. The DBA works primarily at this level.
The complete ER (Entity-Relationship) diagram of the college database.
Internal Level (Physical)
Describes how data is physically stored on disk — file organisation, indexing structures, data compression, and storage allocation. Entirely hidden from users.
A B-tree index on StudentID stored in binary format across disk blocks.
Data Independence
Logical Data Independence
The ability to change the conceptual schema (add a new table, add a column to an existing table) without needing to change the external views that individual users see. Changes to the logical design are shielded from end users.
Physical Data Independence
The ability to change the internal/physical schema (move data to a new storage device, change the indexing method from B-tree to hash) without affecting the conceptual or external schemas above it. Most DBMS systems provide this.
Database Administrator (DBA) Role
Types of Database Users
| User Type | Description | Example |
|---|---|---|
| End Users (Naive) | Use the database through pre-built application interfaces without any knowledge of SQL | A bank teller using the banking software to check account balances |
| Application Programmers | Write programs that interact with the database using embedded SQL or database APIs | A developer writing a student portal in PHP/Python that runs SQL queries |
| Sophisticated Users | Interact with the database directly using SQL query tools and reporting software | A data analyst writing complex multi-table JOIN queries for monthly reports |
| Database Administrator (DBA) | Full control — manages schema, user permissions, backup, recovery, and performance tuning | The IT DBA at a hospital managing the patient records database |
Database Models
How data is logically structured and related
Hierarchical Model
Organises data in a tree structure with a single root node at the top. Each child record has exactly one parent (one-to-many relationships only). Navigation is always top-down following parent-child paths. Fast for well-defined access patterns but rigid — adding new relationships requires restructuring the entire tree. Used in early IBM IMS systems and still present in the Windows Registry.
Network Model
An extension of the hierarchical model where a child record can have multiple parents, forming a graph (web-like) structure with many-to-many relationships. More flexible than hierarchical but significantly more complex to design, navigate, and maintain. Used in CODASYL-standard databases of the 1970s.
Relational Model
Organises all data into two-dimensional tables called relations. Relationships between tables are expressed through shared key values (foreign keys), not physical pointers. Based on E.F. Codd's mathematical relational algebra (IBM, 1970). This is the dominant database model today. SQL is its standard query language. Examples: MySQL, Oracle, PostgreSQL.
Object-Oriented Model
Stores data as objects — directly mirroring object-oriented programming concepts. Each object encapsulates both attributes (data) and methods (behaviour). Supports inheritance, encapsulation, and polymorphism. Well-suited for complex data types like CAD drawings, multimedia files, and scientific simulations. Examples: db4o, ObjectDB.
Comparison of Database Models
| Model | Structure | Relationships | Example System | Main Advantage | Main Disadvantage |
|---|---|---|---|---|---|
| Hierarchical | Tree | One-to-many (single parent only) | IBM IMS, Windows Registry | Very fast for predefined top-down queries | Rigid; no many-to-many; restructuring is costly |
| Network | Graph | Many-to-many supported | IDMS, CODASYL databases | Flexible — models complex real-world relationships | Complex design, navigation, and maintenance |
| Relational | Tables (Relations) | Any via foreign keys and JOINs | MySQL, Oracle, PostgreSQL | Simple, flexible, powerful; standard SQL language | Performance can degrade with very large multi-table JOINs |
| Object-Oriented | Objects with attributes and methods | Inheritance and associations | db4o, ObjectDB | Natural for complex, multimedia, and CAD data | Less standardised; fewer tools and developer familiarity |
SQL & NoSQL
The two major families of database query approaches
SQL — Structured Query Language
SQL is the standardised language for relational databases. It is divided into sub-languages based on their function:
| Category | Full Form | Commands | Purpose |
|---|---|---|---|
| DDL | Data Definition Language | CREATE, ALTER, DROP, TRUNCATE | Define and modify the structure (schema) of database objects like tables and indexes |
| DML | Data Manipulation Language | SELECT, INSERT, UPDATE, DELETE | Add, retrieve, modify, and remove data rows inside tables |
| DCL | Data Control Language | GRANT, REVOKE | Control access rights and permissions on database objects |
| TCL | Transaction Control Language | COMMIT, ROLLBACK, SAVEPOINT | Manage database transactions and maintain ACID properties |
Example SQL Queries
NoSQL — Not Only SQL
NoSQL databases are designed for unstructured or semi-structured data, horizontal scaling across many servers, and high-volume modern web and mobile applications. They trade strict ACID compliance for performance and schema flexibility.
Document Store
eg: MongoDB, CouchDB
Stores data as self-describing JSON/BSON documents. Each document can have different fields. Excellent for product catalogs, user profiles, and content management systems.
Key-Value Store
eg: Redis, Amazon DynamoDB
The simplest NoSQL type. Data stored as key-to-value pairs. Extremely fast reads and writes. Ideal for caching, user session management, and real-time leaderboards.
Column-Family Store
eg: Apache Cassandra, HBase
Organises data in column families rather than rows. Designed for write-heavy workloads at massive scale. Used for time-series data, IoT sensor streams, and analytics.
Graph Database
eg: Neo4j, Amazon Neptune
Stores entities as nodes and relationships as edges. Ideal for social networks (friends of friends), recommendation engines, and fraud detection where relationships matter most.
SQL vs NoSQL Comparison
| Feature | SQL (Relational) | NoSQL |
|---|---|---|
| Data Structure | Fixed schema — tables with predefined columns and data types | Flexible / schema-less — documents, key-value pairs, graphs, columns |
| Scalability | Vertical (scale up — upgrade to a more powerful single server) | Horizontal (scale out — add more commodity servers to a cluster) |
| Transactions | Full ACID compliance (Atomicity, Consistency, Isolation, Durability) | BASE model (Basically Available, Soft state, Eventually consistent) |
| Query Language | Standardised SQL — same syntax works across most vendors | Vendor-specific APIs and query languages (no universal standard) |
| Best For | Banking, ERP, e-commerce — complex relationships and strict consistency | Big data, real-time web apps, social media, IoT, content platforms |
| Examples | MySQL, PostgreSQL, Oracle, Microsoft SQL Server | MongoDB, Redis, Apache Cassandra, Neo4j |
Data Warehousing
Centralised repositories for analytical decision-making
Data Warehouse
A large, centralised repository that stores integrated, consolidated historical data collected from multiple source systems (OLTP databases, spreadsheets, external feeds), optimised for analytical queries and business reporting rather than day-to-day transaction processing. Coined by Bill Inmon, considered the father of data warehousing. Examples: Amazon Redshift, Google BigQuery, Snowflake.
Data Mart
A smaller, focused subset of a data warehouse that serves the analytical needs of a specific business department or function. A large company may have separate data marts for Sales, Finance, and HR, each drawing data from the main enterprise data warehouse.
The ETL Process
EXTRACT
Pull data from multiple heterogeneous source systems — OLTP databases, spreadsheets, web APIs, flat files, and external data feeds. Data may be in completely different formats and locations.
TRANSFORM
Clean, validate, and convert extracted data into a consistent, unified format. Resolve duplicates, fill missing values, standardise date formats, apply business rules, and map codes to descriptions.
LOAD
Load the fully transformed data into the data warehouse. Can be a full load (replace all data) or an incremental load (insert only records that are new or changed since the last ETL run).
OLTP vs OLAP
| Feature | OLTP (Online Transaction Processing) | OLAP (Online Analytical Processing) |
|---|---|---|
| Purpose | Day-to-day operational business transactions | Complex analysis, reporting, and decision support |
| Data | Current, real-time operational data | Historical, aggregated data collected from OLTP systems over time |
| Operations | Frequent INSERT, UPDATE, DELETE (short transactions) | Mostly SELECT with complex GROUP BY, aggregations, and window functions |
| Database Size | Gigabytes to terabytes | Terabytes to petabytes |
| Concurrent Users | Thousands of end users (clerks, customers) simultaneously | Small number of analysts, managers, and data scientists |
| Response Time | Milliseconds — short, fast transactions required | Seconds to minutes — long, complex queries expected |
| Normalisation | Highly normalised (3NF) to eliminate redundancy and ensure integrity | Denormalised (star or snowflake schema) to maximise query read performance |
| Examples | Bank ATM withdrawals, e-commerce order placement, hospital billing | Business intelligence dashboards, sales trend forecasting, executive reports |
Data Mining & Big Data
Discovering patterns and managing massive datasets
Data Mining
The automated process of discovering hidden patterns, correlations, anomalies, and useful insights from large datasets using statistical, mathematical, and machine learning techniques. Also called Knowledge Discovery in Databases (KDD). Example: A bank mining millions of transactions to automatically detect fraudulent activity patterns.
Data Mining Techniques
Classification
Assigns data items to predefined categories (classes). A model is trained on labelled historical data, then used to classify new, unseen items. The most common supervised learning task in data mining.
eg: Classifying emails as SPAM or NOT SPAM; diagnosing a patient as having diabetes or not based on test results.
Clustering
Groups similar data items together without any predefined labels (unsupervised learning). Items within a cluster are more similar to each other than to items in other clusters. No training labels required.
eg: Segmenting customers by purchasing behaviour to target marketing; grouping news articles by topic automatically.
Regression
Predicts a continuous numerical value based on one or more input variables. Finds the mathematical relationship (linear or non-linear) between independent and dependent variables.
eg: Predicting a house price from its size, location, and age; forecasting next quarter's sales revenue from historical trends.
Association Rule Mining
Discovers co-occurrence relationships between variables in large transaction datasets. Produces rules of the form "if customers buy X, they also tend to buy Y." Also called market basket analysis.
eg: The classic finding: customers who buy diapers often also buy beer on Friday evenings. Used by supermarkets for product placement and promotions.
Big Data — The 5 Vs
Big Data refers to datasets so large, fast-moving, or varied that traditional database software cannot capture, store, manage, or analyse them within acceptable time. Characterised by five properties:
Volume
The sheer quantity of data generated. Facebook stores over 100 petabytes of photos. Every day, 2.5 quintillion bytes of data are created globally.
Velocity
The speed at which data is generated, collected, and must be processed. Stock exchanges execute millions of trades per second requiring real-time analysis.
Variety
Data arrives in structured (database tables), semi-structured (JSON, XML logs), and unstructured (images, video, social media posts, audio) formats.
Veracity
The trustworthiness and quality of data. Big Data is often noisy, incomplete, inconsistent, or biased. Poor veracity leads directly to incorrect analysis and bad decisions.
Value
The ultimate goal — extracting actionable business value from raw data. Enormous data volumes are worthless without the ability to turn them into useful insights.
Big Data Processing Tools
Apache Hadoop
An open-source framework for distributed storage (HDFS — Hadoop Distributed File System) and distributed batch processing (MapReduce) of very large datasets across clusters of commodity hardware. Developed by Doug Cutting at Yahoo in 2006, inspired by Google's GFS and MapReduce papers.
Apache Spark
A fast, in-memory distributed data processing engine. Up to 100x faster than Hadoop MapReduce for iterative algorithms because it keeps intermediate data in RAM. Supports batch processing, real-time streaming (Spark Streaming), machine learning (MLlib), and SQL queries (Spark SQL) in one unified framework.
Analytical Questions
Long-answer model answers for exam preparation
Unit 4 Summary
Core topics and important exam questions
Core Topics Covered
Data, information, database, DBMS definitions
File system vs DBMS comparison
DBMS advantages and key concepts (PK, FK, tuple)
3-level architecture (external, conceptual, internal)
Database models: hierarchical, network, relational, OO
SQL sub-languages: DDL, DML, DCL, TCL
NoSQL types and SQL vs NoSQL comparison
ETL process, OLTP vs OLAP
Data mining techniques
Big Data — 5 Vs
Important Exam Questions
What is DBMS? Compare file system and DBMS. (10 marks)
Explain the relational model. Write SQL to select students with GPA > 3. (10 marks)
Explain the types of database models. (5 marks)
What is data mining? Explain three techniques. (10 marks)
Differentiate between OLTP and OLAP. (5 marks)
Explain the 5 Vs of Big Data. (5 marks)
Syllabus Coverage Checklist
Introduction to data, database, DBMS
Database system architecture
Database models (hierarchical, network, relational)
Database applications
SQL and NoSQL concepts
Introduction to data warehousing
Data mining
Concept of big data
How to Remember This Unit
Mnemonics and memory anchors for DBMS
Keys mnemonic: "PC Fans Can"
P=Primary key (unique row identifier) · C=Candidate key (eligible to be primary) · F=Foreign key (links to another table) · C=Composite key (multiple columns combined)
SQL categories: "DDL Defines, DML Manipulates, DCL Controls"
DDL → CREATE / ALTER / DROP (structure) · DML → SELECT / INSERT / UPDATE / DELETE (data) · DCL → GRANT / REVOKE (permissions)
NoSQL types: "Dogs Keep Cats Grounded"
D=Document (MongoDB) · K=Key-Value (Redis) · C=Column (Cassandra) · G=Graph (Neo4j)
ETL order: "Every Tiger Leaps"
E=Extract from source systems · T=Transform and clean the data · L=Load into the warehouse — always this exact order, never reversed
Big Data 5 Vs: "Very Varied Velocity Values Verified"
V=Volume · V=Variety · V=Velocity · V=Value · V=Veracity — five Vs, five fingers on one hand
OLTP vs OLAP memory hook
TP = Transaction Processing (daily Operations, fast writes) · AP = Analytical Processing (historical Analysis, slow complex reads)
Unit 4 Quiz — Database Management
1. Which key uniquely identifies each record in a database table?
2. Which of the following is a Data Definition Language (DDL) command in SQL?
3. The relational database model organises data into:
4. ETL in data warehousing stands for:
5. Which of the following is a NoSQL database?