CFA›Unit 4

Unit 48 hours · CACS 101

Database Management System

DBMS, SQL, Data Warehousing & Data Mining

🗄

Introduction to Data & DBMS

Data, information, databases — and why DBMS exists

Data

Raw, unprocessed facts and figures without context. Examples: 85, 'Ram', 2025. On their own, these numbers and names carry no meaning.

Information

Data that has been processed and organised to give it meaning and context. Example: 'Ram scored 85 marks in the 2025 examination' is information derived from the raw data above.

Database

An organised, structured collection of related data stored together so it can be easily accessed, managed, updated, and retrieved. Example: a college's student records — names, roll numbers, marks, addresses — all stored and interlinked.

DBMS (Database Management System)

Software that creates, maintains, and controls access to a database. It acts as an interface between the database and the users or application programs that need the data. Examples: MySQL, Oracle, Microsoft Access, MongoDB, PostgreSQL.

Advantages of DBMS over Traditional File System

Eliminates Data Redundancy

File systems store the same data in multiple separate files, causing duplication. DBMS centralises data so each fact is stored once and shared.

Ensures Data Consistency

A single update in DBMS reflects everywhere instantly. File systems require manual synchronisation across files, leading to contradictory values.

Data Sharing

Multiple users and applications can access the same database concurrently with controlled, role-based permissions.

Data Security

DBMS enforces access control — different users get different read/write permissions on specific tables or even individual fields.

Data Integrity

Constraints (primary key, NOT NULL, CHECK) ensure only valid data is stored. Traditional file systems have no built-in data validation.

Backup & Recovery

DBMS provides automatic backup mechanisms and transaction logs that allow full restoration of data after hardware failure or corruption.

Key DBMS Concepts

Term	Definition	Example
Table (Relation)	A 2D structure with rows and columns that stores data about one entity type	Student table with columns: StudentID, Name, GPA, Department
Record (Row / Tuple)	A single, complete entry in a table — one instance of the entity	Row: (101, "Ram Sharma", 3.8, "BCA")
Field (Column / Attribute)	A single piece of information category stored for every record	The "GPA" column in the Student table
Primary Key	A field (or combination) that uniquely identifies each record; cannot be NULL or duplicate	StudentID — every student has a unique roll number
Foreign Key	A field in one table referencing the primary key of another table; creates a link between tables	CourseID in the Enrollment table references the Course table
Candidate Key	Any field that could serve as a primary key (must be unique and not null); one is chosen as the primary key	Both StudentID and Email qualify as candidate keys

🏛

Database Architecture

3-tier schema, data independence, DBA and user types

ANSI/SPARC 3-Level Architecture

The three-schema architecture separates the database into three levels so that changes at one level do not affect the others. This separation is the foundation of data independence.

External Level (View)

What each individual user or application sees. Different users can have different views of the same underlying data. Also called the View Level or User Schema.

A student sees only their own marks; an administrator sees all students across all courses.

Conceptual Level (Logical)

Describes the entire logical structure of the whole database — all tables, fields, relationships, constraints, and security rules. The DBA works primarily at this level.

The complete ER (Entity-Relationship) diagram of the college database.

Internal Level (Physical)

Describes how data is physically stored on disk — file organisation, indexing structures, data compression, and storage allocation. Entirely hidden from users.

A B-tree index on StudentID stored in binary format across disk blocks.

Data Independence

Logical Data Independence

The ability to change the conceptual schema (add a new table, add a column to an existing table) without needing to change the external views that individual users see. Changes to the logical design are shielded from end users.

Physical Data Independence

The ability to change the internal/physical schema (move data to a new storage device, change the indexing method from B-tree to hash) without affecting the conceptual or external schemas above it. Most DBMS systems provide this.

Database Administrator (DBA) Role

•Defining the database schema at the conceptual level

•Granting and revoking user access permissions and roles

•Monitoring database performance and optimising slow queries

•Scheduling and verifying regular backups and recovery drills

•Ensuring data integrity by defining and enforcing constraints

•Installing, patching, and upgrading the DBMS software

Types of Database Users

User Type	Description	Example
End Users (Naive)	Use the database through pre-built application interfaces without any knowledge of SQL	A bank teller using the banking software to check account balances
Application Programmers	Write programs that interact with the database using embedded SQL or database APIs	A developer writing a student portal in PHP/Python that runs SQL queries
Sophisticated Users	Interact with the database directly using SQL query tools and reporting software	A data analyst writing complex multi-table JOIN queries for monthly reports
Database Administrator (DBA)	Full control — manages schema, user permissions, backup, recovery, and performance tuning	The IT DBA at a hospital managing the patient records database

📐

Database Models

How data is logically structured and related

Hierarchical Model

Organises data in a tree structure with a single root node at the top. Each child record has exactly one parent (one-to-many relationships only). Navigation is always top-down following parent-child paths. Fast for well-defined access patterns but rigid — adding new relationships requires restructuring the entire tree. Used in early IBM IMS systems and still present in the Windows Registry.

Network Model

An extension of the hierarchical model where a child record can have multiple parents, forming a graph (web-like) structure with many-to-many relationships. More flexible than hierarchical but significantly more complex to design, navigate, and maintain. Used in CODASYL-standard databases of the 1970s.

Relational Model

Organises all data into two-dimensional tables called relations. Relationships between tables are expressed through shared key values (foreign keys), not physical pointers. Based on E.F. Codd's mathematical relational algebra (IBM, 1970). This is the dominant database model today. SQL is its standard query language. Examples: MySQL, Oracle, PostgreSQL.

Object-Oriented Model

Stores data as objects — directly mirroring object-oriented programming concepts. Each object encapsulates both attributes (data) and methods (behaviour). Supports inheritance, encapsulation, and polymorphism. Well-suited for complex data types like CAD drawings, multimedia files, and scientific simulations. Examples: db4o, ObjectDB.

Comparison of Database Models

Model	Structure	Relationships	Example System	Main Advantage	Main Disadvantage
Hierarchical	Tree	One-to-many (single parent only)	IBM IMS, Windows Registry	Very fast for predefined top-down queries	Rigid; no many-to-many; restructuring is costly
Network	Graph	Many-to-many supported	IDMS, CODASYL databases	Flexible — models complex real-world relationships	Complex design, navigation, and maintenance
Relational	Tables (Relations)	Any via foreign keys and JOINs	MySQL, Oracle, PostgreSQL	Simple, flexible, powerful; standard SQL language	Performance can degrade with very large multi-table JOINs
Object-Oriented	Objects with attributes and methods	Inheritance and associations	db4o, ObjectDB	Natural for complex, multimedia, and CAD data	Less standardised; fewer tools and developer familiarity

🔍

SQL & NoSQL

The two major families of database query approaches

SQL — Structured Query Language

SQL is the standardised language for relational databases. It is divided into sub-languages based on their function:

Category	Full Form	Commands	Purpose
DDL	Data Definition Language	CREATE, ALTER, DROP, TRUNCATE	Define and modify the structure (schema) of database objects like tables and indexes
DML	Data Manipulation Language	SELECT, INSERT, UPDATE, DELETE	Add, retrieve, modify, and remove data rows inside tables
DCL	Data Control Language	GRANT, REVOKE	Control access rights and permissions on database objects
TCL	Transaction Control Language	COMMIT, ROLLBACK, SAVEPOINT	Manage database transactions and maintain ACID properties

Example SQL Queries

-- DDL: Create a Student table
CREATE TABLE Student (
  StudentID INT PRIMARY KEY,
  Name VARCHAR(50) NOT NULL,
  GPA DECIMAL(3,2),
  Department VARCHAR(30)
);

-- DML: Insert a record
INSERT INTO Student VALUES (101, 'Ram Sharma', 3.8, 'BCA');

-- DML: Select all students with GPA above 3.0, sorted highest first
SELECT Name, GPA FROM Student
WHERE GPA > 3.0
ORDER BY GPA DESC;

NoSQL — Not Only SQL

NoSQL databases are designed for unstructured or semi-structured data, horizontal scaling across many servers, and high-volume modern web and mobile applications. They trade strict ACID compliance for performance and schema flexibility.

Document Store

eg: MongoDB, CouchDB

Stores data as self-describing JSON/BSON documents. Each document can have different fields. Excellent for product catalogs, user profiles, and content management systems.

Key-Value Store

eg: Redis, Amazon DynamoDB

The simplest NoSQL type. Data stored as key-to-value pairs. Extremely fast reads and writes. Ideal for caching, user session management, and real-time leaderboards.

Column-Family Store

eg: Apache Cassandra, HBase

Organises data in column families rather than rows. Designed for write-heavy workloads at massive scale. Used for time-series data, IoT sensor streams, and analytics.

Graph Database

eg: Neo4j, Amazon Neptune

Stores entities as nodes and relationships as edges. Ideal for social networks (friends of friends), recommendation engines, and fraud detection where relationships matter most.

SQL vs NoSQL Comparison

Feature	SQL (Relational)	NoSQL
Data Structure	Fixed schema — tables with predefined columns and data types	Flexible / schema-less — documents, key-value pairs, graphs, columns
Scalability	Vertical (scale up — upgrade to a more powerful single server)	Horizontal (scale out — add more commodity servers to a cluster)
Transactions	Full ACID compliance (Atomicity, Consistency, Isolation, Durability)	BASE model (Basically Available, Soft state, Eventually consistent)
Query Language	Standardised SQL — same syntax works across most vendors	Vendor-specific APIs and query languages (no universal standard)
Best For	Banking, ERP, e-commerce — complex relationships and strict consistency	Big data, real-time web apps, social media, IoT, content platforms
Examples	MySQL, PostgreSQL, Oracle, Microsoft SQL Server	MongoDB, Redis, Apache Cassandra, Neo4j

🏭

Data Warehousing

Centralised repositories for analytical decision-making

Data Warehouse

A large, centralised repository that stores integrated, consolidated historical data collected from multiple source systems (OLTP databases, spreadsheets, external feeds), optimised for analytical queries and business reporting rather than day-to-day transaction processing. Coined by Bill Inmon, considered the father of data warehousing. Examples: Amazon Redshift, Google BigQuery, Snowflake.

Data Mart

A smaller, focused subset of a data warehouse that serves the analytical needs of a specific business department or function. A large company may have separate data marts for Sales, Finance, and HR, each drawing data from the main enterprise data warehouse.

The ETL Process

EXTRACT

Pull data from multiple heterogeneous source systems — OLTP databases, spreadsheets, web APIs, flat files, and external data feeds. Data may be in completely different formats and locations.

→

TRANSFORM

Clean, validate, and convert extracted data into a consistent, unified format. Resolve duplicates, fill missing values, standardise date formats, apply business rules, and map codes to descriptions.

→

LOAD

Load the fully transformed data into the data warehouse. Can be a full load (replace all data) or an incremental load (insert only records that are new or changed since the last ETL run).

OLTP vs OLAP

Feature	OLTP (Online Transaction Processing)	OLAP (Online Analytical Processing)
Purpose	Day-to-day operational business transactions	Complex analysis, reporting, and decision support
Data	Current, real-time operational data	Historical, aggregated data collected from OLTP systems over time
Operations	Frequent INSERT, UPDATE, DELETE (short transactions)	Mostly SELECT with complex GROUP BY, aggregations, and window functions
Database Size	Gigabytes to terabytes	Terabytes to petabytes
Concurrent Users	Thousands of end users (clerks, customers) simultaneously	Small number of analysts, managers, and data scientists
Response Time	Milliseconds — short, fast transactions required	Seconds to minutes — long, complex queries expected
Normalisation	Highly normalised (3NF) to eliminate redundancy and ensure integrity	Denormalised (star or snowflake schema) to maximise query read performance
Examples	Bank ATM withdrawals, e-commerce order placement, hospital billing	Business intelligence dashboards, sales trend forecasting, executive reports

⛏

Data Mining & Big Data

Discovering patterns and managing massive datasets

Data Mining

The automated process of discovering hidden patterns, correlations, anomalies, and useful insights from large datasets using statistical, mathematical, and machine learning techniques. Also called Knowledge Discovery in Databases (KDD). Example: A bank mining millions of transactions to automatically detect fraudulent activity patterns.

Data Mining Techniques

Classification

Assigns data items to predefined categories (classes). A model is trained on labelled historical data, then used to classify new, unseen items. The most common supervised learning task in data mining.

eg: Classifying emails as SPAM or NOT SPAM; diagnosing a patient as having diabetes or not based on test results.

Clustering

Groups similar data items together without any predefined labels (unsupervised learning). Items within a cluster are more similar to each other than to items in other clusters. No training labels required.

eg: Segmenting customers by purchasing behaviour to target marketing; grouping news articles by topic automatically.

Regression

Predicts a continuous numerical value based on one or more input variables. Finds the mathematical relationship (linear or non-linear) between independent and dependent variables.

eg: Predicting a house price from its size, location, and age; forecasting next quarter's sales revenue from historical trends.

Association Rule Mining

Discovers co-occurrence relationships between variables in large transaction datasets. Produces rules of the form "if customers buy X, they also tend to buy Y." Also called market basket analysis.

eg: The classic finding: customers who buy diapers often also buy beer on Friday evenings. Used by supermarkets for product placement and promotions.

Big Data — The 5 Vs

Big Data refers to datasets so large, fast-moving, or varied that traditional database software cannot capture, store, manage, or analyse them within acceptable time. Characterised by five properties:

Volume

The sheer quantity of data generated. Facebook stores over 100 petabytes of photos. Every day, 2.5 quintillion bytes of data are created globally.

Velocity

The speed at which data is generated, collected, and must be processed. Stock exchanges execute millions of trades per second requiring real-time analysis.

Variety

Data arrives in structured (database tables), semi-structured (JSON, XML logs), and unstructured (images, video, social media posts, audio) formats.

Veracity

The trustworthiness and quality of data. Big Data is often noisy, incomplete, inconsistent, or biased. Poor veracity leads directly to incorrect analysis and bad decisions.

Value

The ultimate goal — extracting actionable business value from raw data. Enormous data volumes are worthless without the ability to turn them into useful insights.

Big Data Processing Tools

Apache Hadoop

An open-source framework for distributed storage (HDFS — Hadoop Distributed File System) and distributed batch processing (MapReduce) of very large datasets across clusters of commodity hardware. Developed by Doug Cutting at Yahoo in 2006, inspired by Google's GFS and MapReduce papers.

Apache Spark

A fast, in-memory distributed data processing engine. Up to 100x faster than Hadoop MapReduce for iterative algorithms because it keeps intermediate data in RAM. Supports batch processing, real-time streaming (Spark Streaming), machine learning (MLlib), and SQL queries (Spark SQL) in one unified framework.

📝

Analytical Questions

Long-answer model answers for exam preparation

📋

Unit 4 Summary

Core topics and important exam questions

Core Topics Covered

✓

Data, information, database, DBMS definitions

✓

File system vs DBMS comparison

✓

DBMS advantages and key concepts (PK, FK, tuple)

✓

3-level architecture (external, conceptual, internal)

✓

Database models: hierarchical, network, relational, OO

✓

SQL sub-languages: DDL, DML, DCL, TCL

✓

NoSQL types and SQL vs NoSQL comparison

✓

ETL process, OLTP vs OLAP

✓

Data mining techniques

✓

Big Data — 5 Vs

Important Exam Questions

▸

What is DBMS? Compare file system and DBMS. (10 marks)

▸

Explain the relational model. Write SQL to select students with GPA > 3. (10 marks)

▸

Explain the types of database models. (5 marks)

▸

What is data mining? Explain three techniques. (10 marks)

▸

Differentiate between OLTP and OLAP. (5 marks)

▸

Explain the 5 Vs of Big Data. (5 marks)

Syllabus Coverage Checklist

✓

Introduction to data, database, DBMS

✓

Database system architecture

✓

Database models (hierarchical, network, relational)

✓

Database applications

✓

SQL and NoSQL concepts

✓

Introduction to data warehousing

✓

Data mining

✓

Concept of big data

🧠

How to Remember This Unit

Mnemonics and memory anchors for DBMS

Keys mnemonic: "PC Fans Can"

P=Primary key (unique row identifier) · C=Candidate key (eligible to be primary) · F=Foreign key (links to another table) · C=Composite key (multiple columns combined)

SQL categories: "DDL Defines, DML Manipulates, DCL Controls"

DDL → CREATE / ALTER / DROP (structure) · DML → SELECT / INSERT / UPDATE / DELETE (data) · DCL → GRANT / REVOKE (permissions)

NoSQL types: "Dogs Keep Cats Grounded"

D=Document (MongoDB) · K=Key-Value (Redis) · C=Column (Cassandra) · G=Graph (Neo4j)

ETL order: "Every Tiger Leaps"

E=Extract from source systems · T=Transform and clean the data · L=Load into the warehouse — always this exact order, never reversed

Big Data 5 Vs: "Very Varied Velocity Values Verified"

V=Volume · V=Variety · V=Velocity · V=Value · V=Veracity — five Vs, five fingers on one hand

OLTP vs OLAP memory hook

TP = Transaction Processing (daily Operations, fast writes) · AP = Analytical Processing (historical Analysis, slow complex reads)

Unit 4 Quiz — Database Management

1. Which key uniquely identifies each record in a database table?

2. Which of the following is a Data Definition Language (DDL) command in SQL?

3. The relational database model organises data into:

4. ETL in data warehousing stands for:

5. Which of the following is a NoSQL database?

← Unit 3: Computer Software Unit 5: Computer Networks →