CFAUnit 4
Unit 48 hours · CACS 101

Database Management System

DBMS, SQL, Data Warehousing & Data Mining

🗄

Introduction to Data & DBMS

Data, information, databases — and why DBMS exists

Data

Raw, unprocessed facts and figures without context. Examples: 85, 'Ram', 2025. On their own, these numbers and names carry no meaning.

Information

Data that has been processed and organised to give it meaning and context. Example: 'Ram scored 85 marks in the 2025 examination' is information derived from the raw data above.

Database

An organised, structured collection of related data stored together so it can be easily accessed, managed, updated, and retrieved. Example: a college's student records — names, roll numbers, marks, addresses — all stored and interlinked.

DBMS (Database Management System)

Software that creates, maintains, and controls access to a database. It acts as an interface between the database and the users or application programs that need the data. Examples: MySQL, Oracle, Microsoft Access, MongoDB, PostgreSQL.

Advantages of DBMS over Traditional File System

Eliminates Data Redundancy

File systems store the same data in multiple separate files, causing duplication. DBMS centralises data so each fact is stored once and shared.

Ensures Data Consistency

A single update in DBMS reflects everywhere instantly. File systems require manual synchronisation across files, leading to contradictory values.

Data Sharing

Multiple users and applications can access the same database concurrently with controlled, role-based permissions.

Data Security

DBMS enforces access control — different users get different read/write permissions on specific tables or even individual fields.

Data Integrity

Constraints (primary key, NOT NULL, CHECK) ensure only valid data is stored. Traditional file systems have no built-in data validation.

Backup & Recovery

DBMS provides automatic backup mechanisms and transaction logs that allow full restoration of data after hardware failure or corruption.

Key DBMS Concepts

TermDefinitionExample
Table (Relation)A 2D structure with rows and columns that stores data about one entity typeStudent table with columns: StudentID, Name, GPA, Department
Record (Row / Tuple)A single, complete entry in a table — one instance of the entityRow: (101, "Ram Sharma", 3.8, "BCA")
Field (Column / Attribute)A single piece of information category stored for every recordThe "GPA" column in the Student table
Primary KeyA field (or combination) that uniquely identifies each record; cannot be NULL or duplicateStudentID — every student has a unique roll number
Foreign KeyA field in one table referencing the primary key of another table; creates a link between tablesCourseID in the Enrollment table references the Course table
Candidate KeyAny field that could serve as a primary key (must be unique and not null); one is chosen as the primary keyBoth StudentID and Email qualify as candidate keys
🏛

Database Architecture

3-tier schema, data independence, DBA and user types

ANSI/SPARC 3-Level Architecture

The three-schema architecture separates the database into three levels so that changes at one level do not affect the others. This separation is the foundation of data independence.

External Level (View)

What each individual user or application sees. Different users can have different views of the same underlying data. Also called the View Level or User Schema.

A student sees only their own marks; an administrator sees all students across all courses.

Conceptual Level (Logical)

Describes the entire logical structure of the whole database — all tables, fields, relationships, constraints, and security rules. The DBA works primarily at this level.

The complete ER (Entity-Relationship) diagram of the college database.

Internal Level (Physical)

Describes how data is physically stored on disk — file organisation, indexing structures, data compression, and storage allocation. Entirely hidden from users.

A B-tree index on StudentID stored in binary format across disk blocks.

Data Independence

Logical Data Independence

The ability to change the conceptual schema (add a new table, add a column to an existing table) without needing to change the external views that individual users see. Changes to the logical design are shielded from end users.

Physical Data Independence

The ability to change the internal/physical schema (move data to a new storage device, change the indexing method from B-tree to hash) without affecting the conceptual or external schemas above it. Most DBMS systems provide this.

Database Administrator (DBA) Role

Defining the database schema at the conceptual level
Granting and revoking user access permissions and roles
Monitoring database performance and optimising slow queries
Scheduling and verifying regular backups and recovery drills
Ensuring data integrity by defining and enforcing constraints
Installing, patching, and upgrading the DBMS software

Types of Database Users

User TypeDescriptionExample
End Users (Naive)Use the database through pre-built application interfaces without any knowledge of SQLA bank teller using the banking software to check account balances
Application ProgrammersWrite programs that interact with the database using embedded SQL or database APIsA developer writing a student portal in PHP/Python that runs SQL queries
Sophisticated UsersInteract with the database directly using SQL query tools and reporting softwareA data analyst writing complex multi-table JOIN queries for monthly reports
Database Administrator (DBA)Full control — manages schema, user permissions, backup, recovery, and performance tuningThe IT DBA at a hospital managing the patient records database
📐

Database Models

How data is logically structured and related

Hierarchical Model

Organises data in a tree structure with a single root node at the top. Each child record has exactly one parent (one-to-many relationships only). Navigation is always top-down following parent-child paths. Fast for well-defined access patterns but rigid — adding new relationships requires restructuring the entire tree. Used in early IBM IMS systems and still present in the Windows Registry.

Network Model

An extension of the hierarchical model where a child record can have multiple parents, forming a graph (web-like) structure with many-to-many relationships. More flexible than hierarchical but significantly more complex to design, navigate, and maintain. Used in CODASYL-standard databases of the 1970s.

Relational Model

Organises all data into two-dimensional tables called relations. Relationships between tables are expressed through shared key values (foreign keys), not physical pointers. Based on E.F. Codd's mathematical relational algebra (IBM, 1970). This is the dominant database model today. SQL is its standard query language. Examples: MySQL, Oracle, PostgreSQL.

Object-Oriented Model

Stores data as objects — directly mirroring object-oriented programming concepts. Each object encapsulates both attributes (data) and methods (behaviour). Supports inheritance, encapsulation, and polymorphism. Well-suited for complex data types like CAD drawings, multimedia files, and scientific simulations. Examples: db4o, ObjectDB.

Comparison of Database Models

ModelStructureRelationshipsExample SystemMain AdvantageMain Disadvantage
HierarchicalTreeOne-to-many (single parent only)IBM IMS, Windows RegistryVery fast for predefined top-down queriesRigid; no many-to-many; restructuring is costly
NetworkGraphMany-to-many supportedIDMS, CODASYL databasesFlexible — models complex real-world relationshipsComplex design, navigation, and maintenance
RelationalTables (Relations)Any via foreign keys and JOINsMySQL, Oracle, PostgreSQLSimple, flexible, powerful; standard SQL languagePerformance can degrade with very large multi-table JOINs
Object-OrientedObjects with attributes and methodsInheritance and associationsdb4o, ObjectDBNatural for complex, multimedia, and CAD dataLess standardised; fewer tools and developer familiarity
🔍

SQL & NoSQL

The two major families of database query approaches

SQL — Structured Query Language

SQL is the standardised language for relational databases. It is divided into sub-languages based on their function:

CategoryFull FormCommandsPurpose
DDLData Definition LanguageCREATE, ALTER, DROP, TRUNCATEDefine and modify the structure (schema) of database objects like tables and indexes
DMLData Manipulation LanguageSELECT, INSERT, UPDATE, DELETEAdd, retrieve, modify, and remove data rows inside tables
DCLData Control LanguageGRANT, REVOKEControl access rights and permissions on database objects
TCLTransaction Control LanguageCOMMIT, ROLLBACK, SAVEPOINTManage database transactions and maintain ACID properties

Example SQL Queries

-- DDL: Create a Student table CREATE TABLE Student ( StudentID INT PRIMARY KEY, Name VARCHAR(50) NOT NULL, GPA DECIMAL(3,2), Department VARCHAR(30) ); -- DML: Insert a record INSERT INTO Student VALUES (101, 'Ram Sharma', 3.8, 'BCA'); -- DML: Select all students with GPA above 3.0, sorted highest first SELECT Name, GPA FROM Student WHERE GPA > 3.0 ORDER BY GPA DESC;

NoSQL — Not Only SQL

NoSQL databases are designed for unstructured or semi-structured data, horizontal scaling across many servers, and high-volume modern web and mobile applications. They trade strict ACID compliance for performance and schema flexibility.

Document Store

eg: MongoDB, CouchDB

Stores data as self-describing JSON/BSON documents. Each document can have different fields. Excellent for product catalogs, user profiles, and content management systems.

Key-Value Store

eg: Redis, Amazon DynamoDB

The simplest NoSQL type. Data stored as key-to-value pairs. Extremely fast reads and writes. Ideal for caching, user session management, and real-time leaderboards.

Column-Family Store

eg: Apache Cassandra, HBase

Organises data in column families rather than rows. Designed for write-heavy workloads at massive scale. Used for time-series data, IoT sensor streams, and analytics.

Graph Database

eg: Neo4j, Amazon Neptune

Stores entities as nodes and relationships as edges. Ideal for social networks (friends of friends), recommendation engines, and fraud detection where relationships matter most.

SQL vs NoSQL Comparison

FeatureSQL (Relational)NoSQL
Data StructureFixed schema — tables with predefined columns and data typesFlexible / schema-less — documents, key-value pairs, graphs, columns
ScalabilityVertical (scale up — upgrade to a more powerful single server)Horizontal (scale out — add more commodity servers to a cluster)
TransactionsFull ACID compliance (Atomicity, Consistency, Isolation, Durability)BASE model (Basically Available, Soft state, Eventually consistent)
Query LanguageStandardised SQL — same syntax works across most vendorsVendor-specific APIs and query languages (no universal standard)
Best ForBanking, ERP, e-commerce — complex relationships and strict consistencyBig data, real-time web apps, social media, IoT, content platforms
ExamplesMySQL, PostgreSQL, Oracle, Microsoft SQL ServerMongoDB, Redis, Apache Cassandra, Neo4j
🏭

Data Warehousing

Centralised repositories for analytical decision-making

Data Warehouse

A large, centralised repository that stores integrated, consolidated historical data collected from multiple source systems (OLTP databases, spreadsheets, external feeds), optimised for analytical queries and business reporting rather than day-to-day transaction processing. Coined by Bill Inmon, considered the father of data warehousing. Examples: Amazon Redshift, Google BigQuery, Snowflake.

Data Mart

A smaller, focused subset of a data warehouse that serves the analytical needs of a specific business department or function. A large company may have separate data marts for Sales, Finance, and HR, each drawing data from the main enterprise data warehouse.

The ETL Process

EXTRACT

Pull data from multiple heterogeneous source systems — OLTP databases, spreadsheets, web APIs, flat files, and external data feeds. Data may be in completely different formats and locations.

TRANSFORM

Clean, validate, and convert extracted data into a consistent, unified format. Resolve duplicates, fill missing values, standardise date formats, apply business rules, and map codes to descriptions.

LOAD

Load the fully transformed data into the data warehouse. Can be a full load (replace all data) or an incremental load (insert only records that are new or changed since the last ETL run).

OLTP vs OLAP

FeatureOLTP (Online Transaction Processing)OLAP (Online Analytical Processing)
PurposeDay-to-day operational business transactionsComplex analysis, reporting, and decision support
DataCurrent, real-time operational dataHistorical, aggregated data collected from OLTP systems over time
OperationsFrequent INSERT, UPDATE, DELETE (short transactions)Mostly SELECT with complex GROUP BY, aggregations, and window functions
Database SizeGigabytes to terabytesTerabytes to petabytes
Concurrent UsersThousands of end users (clerks, customers) simultaneouslySmall number of analysts, managers, and data scientists
Response TimeMilliseconds — short, fast transactions requiredSeconds to minutes — long, complex queries expected
NormalisationHighly normalised (3NF) to eliminate redundancy and ensure integrityDenormalised (star or snowflake schema) to maximise query read performance
ExamplesBank ATM withdrawals, e-commerce order placement, hospital billingBusiness intelligence dashboards, sales trend forecasting, executive reports

Data Mining & Big Data

Discovering patterns and managing massive datasets

Data Mining

The automated process of discovering hidden patterns, correlations, anomalies, and useful insights from large datasets using statistical, mathematical, and machine learning techniques. Also called Knowledge Discovery in Databases (KDD). Example: A bank mining millions of transactions to automatically detect fraudulent activity patterns.

Data Mining Techniques

Classification

Assigns data items to predefined categories (classes). A model is trained on labelled historical data, then used to classify new, unseen items. The most common supervised learning task in data mining.

eg: Classifying emails as SPAM or NOT SPAM; diagnosing a patient as having diabetes or not based on test results.

Clustering

Groups similar data items together without any predefined labels (unsupervised learning). Items within a cluster are more similar to each other than to items in other clusters. No training labels required.

eg: Segmenting customers by purchasing behaviour to target marketing; grouping news articles by topic automatically.

Regression

Predicts a continuous numerical value based on one or more input variables. Finds the mathematical relationship (linear or non-linear) between independent and dependent variables.

eg: Predicting a house price from its size, location, and age; forecasting next quarter's sales revenue from historical trends.

Association Rule Mining

Discovers co-occurrence relationships between variables in large transaction datasets. Produces rules of the form "if customers buy X, they also tend to buy Y." Also called market basket analysis.

eg: The classic finding: customers who buy diapers often also buy beer on Friday evenings. Used by supermarkets for product placement and promotions.

Big Data — The 5 Vs

Big Data refers to datasets so large, fast-moving, or varied that traditional database software cannot capture, store, manage, or analyse them within acceptable time. Characterised by five properties:

Volume

The sheer quantity of data generated. Facebook stores over 100 petabytes of photos. Every day, 2.5 quintillion bytes of data are created globally.

Velocity

The speed at which data is generated, collected, and must be processed. Stock exchanges execute millions of trades per second requiring real-time analysis.

Variety

Data arrives in structured (database tables), semi-structured (JSON, XML logs), and unstructured (images, video, social media posts, audio) formats.

Veracity

The trustworthiness and quality of data. Big Data is often noisy, incomplete, inconsistent, or biased. Poor veracity leads directly to incorrect analysis and bad decisions.

Value

The ultimate goal — extracting actionable business value from raw data. Enormous data volumes are worthless without the ability to turn them into useful insights.

Big Data Processing Tools

Apache Hadoop

An open-source framework for distributed storage (HDFS — Hadoop Distributed File System) and distributed batch processing (MapReduce) of very large datasets across clusters of commodity hardware. Developed by Doug Cutting at Yahoo in 2006, inspired by Google's GFS and MapReduce papers.

Apache Spark

A fast, in-memory distributed data processing engine. Up to 100x faster than Hadoop MapReduce for iterative algorithms because it keeps intermediate data in RAM. Supports batch processing, real-time streaming (Spark Streaming), machine learning (MLlib), and SQL queries (Spark SQL) in one unified framework.

📝

Analytical Questions

Long-answer model answers for exam preparation

📋

Unit 4 Summary

Core topics and important exam questions

Core Topics Covered

Data, information, database, DBMS definitions

File system vs DBMS comparison

DBMS advantages and key concepts (PK, FK, tuple)

3-level architecture (external, conceptual, internal)

Database models: hierarchical, network, relational, OO

SQL sub-languages: DDL, DML, DCL, TCL

NoSQL types and SQL vs NoSQL comparison

ETL process, OLTP vs OLAP

Data mining techniques

Big Data — 5 Vs

Important Exam Questions

What is DBMS? Compare file system and DBMS. (10 marks)

Explain the relational model. Write SQL to select students with GPA > 3. (10 marks)

Explain the types of database models. (5 marks)

What is data mining? Explain three techniques. (10 marks)

Differentiate between OLTP and OLAP. (5 marks)

Explain the 5 Vs of Big Data. (5 marks)

Syllabus Coverage Checklist

Introduction to data, database, DBMS

Database system architecture

Database models (hierarchical, network, relational)

Database applications

SQL and NoSQL concepts

Introduction to data warehousing

Data mining

Concept of big data

🧠

How to Remember This Unit

Mnemonics and memory anchors for DBMS

Keys mnemonic: "PC Fans Can"

P=Primary key (unique row identifier) · C=Candidate key (eligible to be primary) · F=Foreign key (links to another table) · C=Composite key (multiple columns combined)

SQL categories: "DDL Defines, DML Manipulates, DCL Controls"

DDL → CREATE / ALTER / DROP (structure) · DML → SELECT / INSERT / UPDATE / DELETE (data) · DCL → GRANT / REVOKE (permissions)

NoSQL types: "Dogs Keep Cats Grounded"

D=Document (MongoDB) · K=Key-Value (Redis) · C=Column (Cassandra) · G=Graph (Neo4j)

ETL order: "Every Tiger Leaps"

E=Extract from source systems · T=Transform and clean the data · L=Load into the warehouse — always this exact order, never reversed

Big Data 5 Vs: "Very Varied Velocity Values Verified"

V=Volume · V=Variety · V=Velocity · V=Value · V=Veracity — five Vs, five fingers on one hand

OLTP vs OLAP memory hook

TP = Transaction Processing (daily Operations, fast writes) · AP = Analytical Processing (historical Analysis, slow complex reads)

Unit 4 Quiz — Database Management

1. Which key uniquely identifies each record in a database table?

2. Which of the following is a Data Definition Language (DDL) command in SQL?

3. The relational database model organises data into:

4. ETL in data warehousing stands for:

5. Which of the following is a NoSQL database?

BCAStudyHub

Your complete interactive study guide for TU BCA Semester I — covering all subjects with interactive tools, past papers, and exam prep.

TU BCASemester I

Program Info

University
Tribhuvan University
Program
BCA — Bachelor in Computer Application
Semester
I (First)
Subjects
5 (4 live, 1 coming soon)

Made by SawnN