Developing Database Systems

Week 1, lecture 1
Week 1, lecture 2

Week 2, lecture 1 – Entity Relationship Diagrams

Week 2, lecture 2
Week 3, lecture 1
Week 3, lecture 2 – Relational Algebra
Week 4, lecture 1 – Data Flow Diagrams
Week 4, lecture 2
Week 5, lecture 1
Week 5, lecture 2

Week 1, lecture 1

What is a system?

It includes a number of things that are organised and collected to achieve a goal, perhaps with subsystems.

Information Systems: they process information in a predefined timely and accurate manner to usefully support some activity, e.g. for management.

Businesses are uncertain (probabilistic) but some systems can be predictable as people can be routine in their behaviour for example.
Deterministic systems are operated according to a predefined set of rules. behaviour is predicted. We need to know the inputs AND current state of the system, e.g. turning the lights on when they are already on makes no sense!

Control Systems: E.g. a thermostat. It checks on the heating by taking the temperature and action is carried out based on the checks. “Closed loop system with negative feedback”.
The process could be a construction of a piece of software. Feedback could come from management perhaps based on financial budgeting.

Historical Reports: this lists what happens in the past, e.g. September 1999 sales figures
Exception Reports: It lists exceptions, e.g. listing in a library system everyone with out standing fines.
Forecast: Projections for the future, e.g. stock levels if sales stay the same in such a month.
Information should be relevant, timely and accurate.

Software Development Lifecycles

Water Fall

Business survey/initial study: like a mini feasibility study – filtering out the unworkable ideas
Feasibility: is the idea worth doing? Money wise? Practical? Organisational – can the organisation adopt a new computer system? If it is not feasible, the project will not continue.
Systems Analysis: The current environment investigation sees what needs to be in the new system and identifies problems with it. By this, one creates the requirement specification. It is written in user orientated language for the customer to understand.
The specification is as fixed as possible because later changes can be hard and/or costly.
System Design: decisions are made about the hardware and software and what language(s) the system will use and how it will be implemented. Making these decisions too early can limit the analysis (the what).
Testing: link testing – make sure the modules of the programs fit together.
systems testing – we check the system against what we said we would do
user acceptance testing – handing the system over to the user for them to try it. They tend to view the system in terms of practicality.
operations acceptance testing – operations trying to use the system. They might be interested in what happens when a program crashes.
parallel run – run both the old and new system together and compare the differences. There needs to be some additional software written to compare them.
Implementation: setting up the new system. Bugs can be discovered and users can be using the system incorrectly. Problems can be unexpected.
Post-implementation Review: 6 months or so later, one sees how well the project has gone.
Maintenance: bug fixing, user requirements can change, external hardware and software changes, legislation changes such as adopting the euro and freedom of information.
Obsolescence: the system is no heavily patched that it needs rewritten and replaced.

Features throughout this system:

  • documentation is created
  • inputs and outputs at each stage
  • planning
  • deliverables
  • estimation of costs
  • Validation and Verification (V&V;) – validation – ensuring that the output from each stage meets the user requirements. Verification – ensuring that the output reflects the inputs, e.g. program testing.

Are we building the right product and the product right?

There is an attempt to keep costs down in the early stages. We do not move onto the next stage if the current one is not complete to avoid feedback. We can trace it back through the life cycle. e.g. planned, written, tested of one project element.

Advantages of the Waterfall model

  • structured management style
  • easier to organise the project
  • good for when the requirements are rigid.
  • ideal for mission critical and safety related projects

Disadvantages of the Waterfall model

  • difficult to fix errors discovered in the programming stage compared to the system analysis
  • produces a large amount of paper deliverables
  • when it gets too expensive to fix errors, they are ignored.
  • Users might not get what they wanted
  • Can a user define what they want?
  • In testing, it is the first time the user sees the system. They might not like it!!

Other Models

Idea -> Feasibility -> System Design -> Implementation -> Go Live -> Maintenance
Rapid application development, prototyping, evolution such as spiral

The waterfall model is a good roadmap to development and in aiding management control. Sometimes very small or one off projects use no software process model. Models vary so one is better for one job than it is for another. It is important to pick the right one for the job.

Rapid application development: it is quicker and cheaper to use. The user probably gets a system that meets their requirements more. Maintenance can be hard as documentation is minimal.

Using bits of different models for different parts of the project. Perhaps one lifecycle model for the database, another for the web interface,

Methodology consists of:

  • stages
  • techniques
  • an underlining philosophy
  • and arguably, training and tools

Database definition:

A common pool of self-describing, integrated, interrelated, shared data where each item of data is stored once only and is used in a wide range of applications.

DBMS (database management system):

A software system that handles all access between application programs and one or more database facilitating data:

  • maintenance
  • CRUD (create, retrieve, update, delete)
  • control

Week 1, lecture 2

In databases, columns are called "attributes"

The Relational Model of Data

This is based on two concepts: set theory and logic.
A set is a collection of things you are talking about: A = {1,2,8,13,7}
A might have a name such as employee numbers. It could be a collection of
fruit types, colours, etc Colours would be a string Numbers would be an integer.

Properties of sets

  • No duplicates, e.g. no two called yellow, or 3
  • No order in the sets so 1,2,3 is the same as 2,1,3

Operations on sets

Union – U – {1,2.3} ∪ {2,3,5,7} = {1,2,3,5,7}

Intersection – ∩ – {1,2,3} ∩ {2,3,4,5,6} = {2,3}
This is a list of what is in both sets.
{1} ∩ {2,3} – there are no intersections. this is called an empty set. {}

Difference – {1,2,3} – {3,4,5}

A – B = {1,2} – what elements are in A but not in B.
{4,5,6} – {4,5,8} = {6}
{2,3} – {2,3,4,5} = {}

Sub sets – B ⊂ A – B is a subset of A

E.g. {1,2,3} ⊂ {3,4,2,1,7} – TRUE
{w,e } ⊂ {W,e,n} – FALSE
On a graph, you can have coordinates of (2,3) and (3,2). They are called an
ordered pair. They are not the same though even though they contain the same numbers.

Cartesian Product of sets

A = {1,2,3}   B = {p,q}
A x B = Cartesian product
{(1,p), (1,q), (2,p), (2,q), (3,p), (3,q)}
The order of the pairs in the brackets can be in any order but the X and Y or A and B in this case, has to be A first then B.
B x A = {(p,1), (p,2), (p,3), (q,1), (q,2), (q,3)}
Cardinality – counts the elements of the set. The cardinality of B is 2
and for A, it is 3. The cardinality of A and B is 6.
card(A) = 3

Tuple – it is a record or row and are also sets like this: {1,2,3}.
This is a 3-tuple.
Relation – a loose definition of a relation is a subset of a Cartesian product
Relation R = {10, ‘A’,’X’}, (20,’R’,’Y’), (30,’A’,’Z’)}

There are subsets that are not true. There might not be any accounting
department in Dallas though it is in New York. Dallas might only have HR there.
In this way, not all the possible combinations are actually factual.

deptno dname loc
10 A x
20 R y
30 A z


SQL> desc emp;
It describes the attributes of the table “emp”.

varchar2 (10) – number up to 10 numbers long
char (4) – 4 digits long but can be a minus.
number (3,2) – 3 is a precision value; 2 is a scale
value, e.g. 305.45

DON’T STORE NULLS IN DATABASES!!! Nulls mean an entry is not allowed. E.g. the president has no manager.

Data types are sets. The types put constraints on the data.

To insert: INSERT INTO dept VALUES(50,’manager’,New York’);

Primary keys are unique so cannot be repeated elsewhere in the database but
some tables do not have primary keys. It is not automatically created when
building a table and needs to be manually set.

Insert statement is effectively: {….} ∪ {50,’manager’,’New York’}

DELETE from dept WHERE deptno = 60;
{1,2,3} – {3}  = {1,2}
You cannot always delete it though because of foreign keys. This is because there for example,
people still working at department number 10, so it can not be removed. This is
because of referential integrity constraints.
If you had a department where no one worked in it, the it can be deleted.

A relational model is:

  1. A collection of data types
  2. a collection of operators (+, -, *, powers, square root, etc)
    Varchar2: substring(2,3,’GLADYS’) = ‘LAD’
    DATE “11-JAN-05” +1 would add on an extra date.
    Databases can contain binaries (files).
    SELECT empno, ename FROM emp WHERE sal > 2000;
    Select from the emp table the column called empno and ename where sal (salary is more than 2000.
    SELECT … WHERE deptno = empno; – This would return “no rows returned” – the empty set {}.
    It is silly to compare department number with employee number!
  3. a collection of integrity rules, e.g. primary keys (cant have duplicate

Relations are a set of tuples
Predicate – each tuple has an associated predicate. Predicate reduces to an individual
propositions for each tuple. The cardinality is the number of tuples (elements in a set).
An attribute is a column.


type = named set of values. Each has an associated types with well defined set of operations.
Constraints limit the value entered into the database.

Relational Schema

student (student ID, surname, initials, degree)
Tuples are unordered. Attributes are unordered.

Relational Data Integrity

Date not before 1960, wages not over �50,000, salaries are positive, description is lower case.

Week 2, lecture 1 – Entity Relationship Diagrams

Database Design consists of: Conceptual, logical and physical.

Conceptual – high level implementation free design. It is an overview. This design is done using an entity relationship diagram. (ERD).
Logical – taking the conceptual design and mapping it onto a database.
Physical – optimising the database, memory, database speed. A slow hard drive can hold back the overall access speed of a database. It is essential to have a fast hard drive when a lot of users access the database at one time.

Systems Analysis

(Current environment investigation)
This is finding out about the existing system through: interviews, using the system, reading any documentation.

Guidelines: be objective, valid – the facts need to be checked to ensure they are correct,
always follow through reports when they are handed onto other people to see if they get used, diplomacy is required,
always start at the top of an organisation or department because it can give you better access to information if a
manager has authorised people to discuss the system requirements.

It is useful to look at reports, forms, etc, and see how they are filled in and if the whole forms are completed or if there are sections missed out or bits added to the bottom. This checks to see if the forms still meet current requirements.
Observation – one has to bear in mind that people can behave differently if they are being observed.
Questionnaires – it is a good way of getting the same information from a lot of people and if they are spread out geographically. it is hard to validate the responses and has a poor return rate.
statistical sampling – predict about the population and take sample of the population
Interviewing – good listening skills are required and keeping to the subject. Questions are written before the interview and the interviewer goes to the interviewee. First impressions are important.
It is good practice to write up a summary of the interview and send it to the interviewee to check and confirm.

Entity Relationship Diagrams

It models the data and the relationships between the data. It models logically. The ERD tries to understand which information is needed to be held and their relationships.


they can be anything of significance about which information needs to be known or held. Any named things usually are nouns such as person, rank, subject, etc.
entity types – car, garage, – this is not the car or garage name.
The car registration is the identifier for the car, the entity type.

Categories of entity types:
tangible – person, building, computer, object
active – accident, customer complaint,
conceptual – cost centre (an area where there are costs/expenditures come from and are identified)
A document is not necessarily an entity. A document needs its own data. an invoice or contract has a date, number, total.
A student card for example just represents a student.


They: identify, classify, quantify and qualify the state of an entity.
Occurrence information is useful to get an idea of size and amount of the data.

Derived attributes are to be avoided as they are reliant on another set of attributes, e.g. age – employee’s age comes from the DOB.
Artificial attributes, e.g. National Insurance number, Student ID no. These are all unique


They are an association between two entities

Garage can hire out more than one car.

Types of Relationships

  • 1:1 – a one to one relationship
  • 1:M – one to many relationship
  • N:M – many to many relationship

This is called the cardinality of the relationship.

But an invoice could have more than one payment if done in instalments or if paying for more than one thing with the same invoice or having more than one invoice. It could work however, if the payment is in a supermarket and the invoice is the till receipt.


WHERE Clause

It looks at every row to see if it is true or false:
SELECT * FROM emp WHERE sal = 2000;
SELECT * FROM emp WHERE sal = comm; – This is false as no results are returned. comm is another table
SELECT * FROM emp WHERE sal = sal; True – all the entries for sal are returned.

NVL command

This makes the null display as something else, e.g.
SELECT NVL(comm, 0) FROM emp;
The result will display as 0.
NVL (column_name, non-null value) The non-null value must match the column data type.
To return the nulls in use from a column,
SELECT * FROM emp WHERE comm is null;
NOTE: In Oracle, use IS NULL rather than = NULL


SELECT avg(sal) FROM emp;
This gives the average salary.

Week 2, lecture 2


It can’t be directly implemented in a computer system.

A student can attend more than one course. A course can have many students.
There is a trick to get around this:

One should bring the crow’s feet to the middle.
The identifying attribute in enrolment can be a mixture of student and course id to make an enrolment id.
Adding a date of enrolment helps if someone fails and retakes.

By breaking the N:M, the system can be implemented and represents reality better.
Jobsheet could be called manjoblink for the entity links but it is used if one cannot think of a good name for it.
It is not ideal to only have the identifying attribute for the entity and with no other attributes.
We aim to have more than the ID in the link.

Mandatory relationship: when each entity occurrence must participate in the relationship and is shown by a solid line.
Optional relationship: when each entity occurrence may exist without participating in a relationship and is shown by a dotted line.

Exclusive Relationships

An arch shows mutable exclusively. It means either a property has rental or mortgage payment:

There might be a third category if there mortgage has been paid off. However, we use the dotted lines instead.
The dotted lines give the option of neither rent payment or mortgage payment so mortgage payment becomes neither
options if it has been paid:

What is a relation?

A relation is a set of tuples. A tuple are attribute values from a set.
dept(deptno: deptnotype. dename: denametype, loc: loctype) Type is the set
operators on these values.
heading is the another name for the relational schema.

SQL – Create table statements

    deptno number(2)
    constraint pk_deptno primary key,
    dname varchar2(20)
    constraint upp_check
    check (dname = upper(dname)),
    loc varchar2(150);

  &nnbsp; as
SELECT deptno, dname
FROM dept;

This removes the loc column. the constraints from the previous table are not
carried over. it is a derived table.

Relational Data Integrity

Candidate keys – it is a set of attributes that cannot have the same values
in two different tuples.
An identifier is a candidate key as all the values will be different.
A date of birth is not one as there might be more than person with the same DOB.
all relations require a candidate key. Relations can have more than one
candidate key.
The primary key is unique and one is defined in the tuple.
Integrity constraints – this could be that employee number is 4 digits

Foreign Keys

A set of attributes, K of a relation R is a candidate key of R if:

  • no two distinct tuples R have the same value for K (unique)
  • no proper subset of K has the same uniqueness property (irreducible)

Course: courseno is a primary key
Student: studentno is a primary key

In enrol, it has courseno and studentno and grade, etc

Enrol: courseno studentno grade  

This shows a M:N relationship but goes in the enrol table.

c1 s1 A
c1 s2 B
c2 s1 C

Course: courseno cname  

Courseno and studentno are candidate keys. A student
shouldn’t take the same course twice.

c1 biology
c1 maths
c3 cooking

Student: studentno sname  

The student and course has to exist to go into the
enrol table.

s1 Fred
s2 Mary
s3 Brian

Foreign key – any value put into enrol table has to exist in both the student
and course tables first.

Foreign keys do not have to have anything to do with a primary key. It is
related to a candidate key. There is not always a primary key in another table
but usually.

Let R2 be a base relation. A foreign key in R2 is a subset of the attributes
of R2, say FK, (deptno) such that there exists a base relation R1 (dept) (where
R1 could be equal to R2) with a candidate key CK and for all time,
each value of FK in the current value of R2 is equal to the value of CK in some
tuple of R1

R2( …, …, FK, …

                R1(…, …, …, CK, …;);););)

Both FK and CK must also be defined using the same type and are linked
together as a relationship.

Let R1 = Emp and R2 = Dept. A foreign key (FK) in the emp table/relation is {deptno}.
As we require all employees to belong to
departments that exist then we need to insist that any value given to deptno
on the emp table (or emp.deptno) must be equal to a current value of
dept in the dept table (or dept.deptno)


Cascade – if you delete a department you can delete the people in it
as well. First move the people into another department to avoid deleting the
people. Oracle by default does not allow cascade (restrict). Cascade can change
the same values in other tables.

You cannot have nulls in a foreign key. Say if someone does not work in a
department however! You can get around this by adding "is null" in SQL
statements to get around this so nulls will appear in query results.

Nullify – This is a new feature of Oracle. If you delete department
number 10, it just removes 10 out of the emp table and replaces it with a null
as the people still remain and are not deleted.

Week 3, lecture 1

Redundant relationships

Relationship between the same type of entities.


If there is only one occurrence of an entity, then its very likely that it isn’t n entity, e.g. head office. This will not be modelled.

Week 3, lecture 2 – Relational Algebra

The Relational Algebra is a formal language where the objects of interest are relations and the operators
allow us to manipulate and retrieve information from these relations.
Although it has not been implemented in a commercial environment it has become the benchmark for other data retrieval languages.
It can be proved that the Relational Algebra is ‘complete’ in that any question can, ultimately, be answered by only
using the operators of the language. The Relational Calculus is a logically equivalent language and
it this that formed the basis of SQL. The original algebra had the following operators:

  • Select (Restrict)
  • Project
  • Product
    • Theta Join
    • Equijoin
    • Natural Join
    • Semijoin
    • Outerjoin
  • Difference
  • Union
    • Intersect
    • Divide

The five operators identified by ν are the primitive operators whilst the additional
three operators identified by μ are ‘three useful additions’.
Newer versions of the Relational Algebra have acquired a range of additional operators,
three of which we shall look at briefly, namely Extend, Summarize and Rename.

Some Sample Relations/Tables

STUDENT Stuid Stuname Major Credits
  S1015 Jones, Mary Maths 42
  S1005 Lee, Perry History 3
  S1001 Smith, Tom History 90
  S1010 Burns, Edward Art 63
  S1002 Chin, Ann Maths 36
  S1013 McCarthy, Owen Maths 0
  S1020 Rivera, Jane CSC 15

CLASS Course# Facid Sched Room
  ART103A F101 MWF9 H221
  HST205A F115 MWF11 H221
  CSC201A F105 TuThF10 M110
  MTH101B F110 MTuTh9 H225
  CSC203A F105 MThF12 M110
  MTH103C F110 MWF11 H225

FACULTY Facid Facname Dept Rank
  F101 Adams Art Professor
  F115 Smith History Associate
  F105 Tanaka CSC Instructor
  F110 Byrne Maths Assistant
  F221 Smith CSC Professor

ENROLLMENT Course# Stuid Grade
  ART103A S1001 A
  CSC201A S1020 B
  CSC201A S1002 F
  ART103A S1010  
  ART103A S1002 D
  MTH101B S1020 A
  HST205A S1001 C
  MTH103C S1010  
  MTH103C S1002 B

The associated schema are given below. The primary keys are underlined and the foreign keys are included in [ ] brackets.

Student( Stuid, Stuname, major, credits)
Course( Course#, [Facid], Schedule, Room)
Faculty( Facid, Facname, Dept, Rank)
Enrollment([ Course#], [Stuid], Grade)


This operator acts on a single table (well, strictly a relation …) and returns rows (tuples) from that table (relation)
E.g. new_table := SELECT table_name WHERE condition

T1 := SELECT Student WHERE Stuid = ‘S1001’
∴ T1 is a table with one row …

Stuid Stuname Major Credits
S1001 Smith, Tom History 90

T2 := SELECT Student WHERE Major = ‘Maths’
∴ T2 is a table with 3 rows…

Stuid Stuname Major Credits
S1015 Jones, Mary Maths 42
S1002 Chin, Ann Maths 36
S1013 McCarthy, Owen Maths 0

Alternative Notations (The case Is IrrEleVAnt)

T1 := Select STUDENT Where STUID = ‘S1001’
(the one we use)
T1 := STUDENT Where STUID = ‘S1001’
Select STUDENT Where STUID = ‘S1001’ Giving T1
T1 :=σSTUID=’S1010’(STUDENT)


This also acts on a single table and returns a table.
T1 := PROJECT table_name OVER column_list
E.g. R1 := PROJECT Class OVER Facid, Room
R1 is a table with two columns. Repeated rows DO NOT appear.

Facid Room
F101 H221
F115 H221
F105 M110
F110 H225

(c.f. Select job from emp; in SQL)

Alternative notations (N.B. The case Is IrrEleVAnt)

T1 : = CLASS[Facid, Room] Project CLASS Over Facid, Room Giving T1


(very similar to the Cartesian Product of sets)
New_Table := Table1 x Table2

Alternative notations (N.B. The case Is IrrEleVAnt)

T1 := STUDENT x ENROLLMENT   (we shall mainly use this one)

A       B  
Nu. Age Sex   Id. Col.
10 43 M   1001 Red
11 21 F   1002 Yellow
12 80 F   1003 Red
        1004 Green

T := A TIMES B (or A PRODUCT B or A x B)

Nu Age Sex Id. Col.
10 43 M 1001 Red
10 43 M 1002 Yellow
10 43 M 1003 Red
10 43 M 1004 Green
11 21 F 1001 Red
11 21 F 1002 Yellow
11 21 F 1003 Red
11 21 F 1004 Green
12 80 F 1001 Red
12 80 F 1002 Yellow
12 80 F 1003 Red
12 80 F 1004 Green


An example is
T1 := A x B Where Condition
(e.g. Condition is something like Deptno > 30)
(i.e. simply a product with a Where condition for selecting rows

Thus the above could be written as
T0 := A x B
T1 := Select T0 Where Condition


This is a form of Theta Join where the Condition involves the equality of common columns. E.g.
































T1 := A Equijoin B Over z (the Over z is optional: see later)
Thus T1 is:





















This is equivalent to
T0 := A x B
T1 := Select T0 Where A.z = B.z


This is the same as an Equijoin with the removal of the duplicated column.
Thus, using A and B from above then T1 := A Natural Join B would give

















Because the Natural Join is such a common operation we use the notation:
T := JOIN A, B OVER columns
where the OVER columns is optional: why?)
(Or sometimes T := A xcolumns B)

Week 4, lecture 1 – Data Flow Diagrams

They are used to model processes. It breaks the system down into smaller
understandable processes/ The activity in the process effects data. Below is a
process box:

Data goes in and out of the box. No data is gained or lost in a process. Data
Flow diagrams show transformation of data so a report would not go onto the
diagram unless it is fundamentally essential to the understanding of the DFD.
Using the same data flow name elsewhere on the diagram means that its the same

Data store


n is a number. x can be one of 4 values:
Dn – digital
Mn – Manual such as a filing cabinet
Tn transitory file (digital) – temp file between two processes
Tn(m) – transitory file (manual) – like a box with files in it such as an out tray


In SSADM it is called an external entity. It looks like a squashed circle
with the name of it in the middle. It shows where data starts and finishes. It
can be seen as sources and destinations of the data.

Week 4, lecture 2

SET OPERATIONS (Tables/Relations must be Union Compatible)

Main Facid Facname Dept Rank
  F101 Adams Art Prof
  F105 Tanaka CSC Inst.
  F221 Smith CSC Prof
  F456 Poulson Hist Ass

Branch Facid Facname Dept Rank
  F101 Adams Art Prof
  F105 Tanaka CSC Inst.
  F110 Byrne Maths Ass.
  F221 Smith CSC Prof


(strictly the Outer Equi Join)
Consider these (highly dubious) tables/relations

STU        ENR  
Stuid Stuname   Course# Stuid
S1005 Perry   CSC201A S1020
S1001 Smith   CSC201A S1010
S1010 Burns   MTH103A S1001
S1020 Rivera   CSC203A S2222


Stu.Stuid Stuname Course# Enr.Stuid
S1005 Perry
S1001 Smith MTH103C S1001
S1010 Burns CSC201A S1010
S1020 Rivera CSC201A S1020
CSC203A S2222

i.e. this is the Equijoin together with all the rows that did not participate in this join.
The ‘missing’ entries are nulls.
We often (in fact, usually, need to evaluate the LEFT or RIGHT outer join.
Consider the tables/relations:

F     C  
Facid Facname   Course# Facid
1 BLOGGS   C1 1
2 SMITH   C2
3 JONES   C3 1
4 PHELAN   C4 3
      C5 4

Write down: 1) F OuterJoin C, 2) F Left OuterJoin C, 3) F Right OuterJoin C, 4) C Left OuterJoin F, 5) JOIN F, C
Answers can be deduced from:

F.Facid Facname Course# C.Facid  
1 BLOGGS C1 1  
1 BLOGGS C3 1  
3 JONES C4 3  
4 PHELAN C5 4  
2 SMITH Left
C2 Right


(a difficult operation to define informally)
Consider the tables/relations A and B defined below:

A       B












































X := A DIVIDEBY B (or A / B) is formed by choosing
the rows of A that have a match for ALL the rows of B and displaying the columns displayed are those in A and NOT in B.
Thus X is:

a b
q s
e f

N.B.   X x B is a subset of A
X x B =

a b   c   a b c
q s




q s 4
e f   5   q s 5
          e f 4
          e f 5

Some Additional Operators


If we have the schema EMP(empno, ename, sal, deptno),
where EMP is the relation name and empno, ename, sal and deptno are the attribute names, and an associated set of tuples then:
NewEmp := Emp RENAME sal AS monthly_sal
will produce the relation:
NewEmp (empno, ename, monthly_sal, deptno)
with the same set of tuples.

Consider the following problem based on the relation/table A given below:
find all the athletes who hold more than 1 world record.

Event Athlete
100 Fermi
200 Gauss
400 Fermi
800 Laplace

ONE possible SQL solution is:
Select    distinct a.athlete
From   A a, A b
Where   a.athlete = b.athlete
And     a.event != b.event;

(NB   Write down at least one alternative SQL solution)

Now consider a solution in Relational Algebra and ponder on why we need to use RENAME.
T1 := A RENAME Event as Event1
T2 := JOIN T1, A {OVER Athlete}
T3 := SELECT T2 WHERE Event > Event1


Consider the syntax:
T := EXTEND Emp ADD sal*12 AS Annsal

This is similar to the SQL:
Select empno, ename, sal, deptno, 12*sal “Annsal”
From Emp;

Now consider:
T := EXTEND Emp ADD sal*12 AS Remuneration
WHERE Remuneration > 3000;

(i.e. Extend the relation/table by adding a column called Remuneration and also restrict the tuples/rows to be displayed.)

NOTE: Rename is not a primitive operator. Consider:
T1 := EXTEND Dept ADD Loc AS Place
A := PROJECT T1 OVER Deptno, Dname, Place
The above is equivalent to:
A := Dept RENAME Loc as Place


The Relational Algebra as the usual ‘group’/’summary’ functions. E.g.
SUM( ), AVG( ), MAX( ), MIN ( ) and COUNT.

Consider the SQL query:
Select   d.deptno, dname, count(*) “No. Of Employees”
From   emp e, dept d
Where   e.deptno(+) = d.deptno
Group by d.deptno, dname;

Which will produce a display something like

Deptno Dname No. Of Employees
30 SALES 6

How can we achieve this in the Relational Algebra?

Firstly, consider a partial solution.
T := SUMMARIZE Emp BY (Deptno)
   ADD Count AS No. Of Employees

which will produce only the information for departments with more than zero employees.
How could you amend the above to produce the required result?

Week 5, lecture 1

Week 5, lecture 2 – Functional Dependencies

An attribute Y is functionally dependent on attribute X if, and only if, each possible value of X has associated with it
precisely one value of Y.
The notation used is X -> Y.
(N.B. X and Y could be sets of attributes.)
We use the notation X !->Y to state that "X does not determine Y".

Assumptions for discussion
Note, here YZ = Y ∪ Z.
If X -> Y and X -> Z then X -> YZ
If X -> YZ then X -> Y and X -> Z
If X -> Y and Y -> Z then X -> Z  (transitive)
If X -> Y then XA -> Y


1. In the emp table/relation you are using in the SQL tutorial we have:
{empno} -> {ename}, {empno} -> {job} and so {empno} -> {ename, job}.
(in fact {empno} -> {ename, job, hiredate, sal, comm, deptno})
Note also that {ename} !-> {empno}.  Why?

2. In the dept table/relation we have {deptno} -> {loc, dname}
This, again follows from the semantics of the underlying relation, i.e. that all departments
are to have a unique department number.

3. Consider the following table of values.

Stuid Course_Id Mark
S1010 C234 56
S1011 C469 77
S1023 C654 35
S5671 C111 77
S6560 C207 69

It is very important to note that we cannot infer that {Stuid} -> {Course_Id} as we have not been told anything about the semantics. In fact, if we make the assumption that a student can only have one recorded mark for
any course then we can now infer that {Stuid, Course_Id} -> {Mark}.  Some different data might help to confirm this notion.

Stuid Course_Id Mark
S1010 C234 56
S6560 C207 69
S6560 C208 33
S1010 C235 70

why can we
be sure that {Mark} !-> {Stuid}

4. Data about the cars driven by sales representatives are kept on file.  Discuss what the following
assumptions governing the underlying model actually mean:

  1. no two representatives with the same name live at the same address;
  2. no two car manufacturers produce models with the same name.

Some typical data for the calendar year 2003 is given below where N = Name,
A = Address, C = car, M = model and D is the distance covered.  You may also
assume that representatives keep the same car for the whole year.

Green 1, Park, Bristol Vauxhall Cavalier 31272
Patel 7, New Street, Brum Rover 216 25351
Green 4, The Place, Manchester Vauxhall Cavalier 28736
Brodie 9, The Hill, Glasgow Ford Escort 35162
Smith 7, The Exchange, Manchester Vauxhall Cavalier 31272
Smart 4, Euston, London Ford Granada 27529
Schwartz 4, The Place, Manchester. Vauxhall Cavalier 34172

Using the above scenario, coupled with the data

  1. Confirm that   C !-> M, (C does not determine M) A !->
    N, (There can be more than person living at the same address) D !-> N.
    (D doesn’t determine N because its possible that more than one sale
    representative can theoretically drive the same number of miles)
  2. Explain why M -> C. Cavalier has to be made by Vauxhall but Vauxhall can make more than one model.
  3. Explain why NA -> CMD (i.e. {N, A} -> {C, M, D}). NA is unique because no two representatives with the same name live
    at the same address and they are candidate key if N is combined with A.

5. A veterinary practice stores data about pets. The following shorthand is used to keep appropriate details. N
– name of pet, O – name of owner, B – type of animal, T – telephone number of owner, A – age of pet.

Part of a typical file of details can be represented by the following ‘table’ :

Spot Dog 4 A.Brown 6178
Fred Dog 6 A.Brown 6178
Ming Cat 3 T.Nugent 5246
Fred Dog 2 T.Nugent 5246
Doris Gerbil 2 T.Nugent 5246

Some of the underlying assumptions are:

  • each pet is a unique type of animal and has one (and only one) owner;
  • no single owner has two pets with the same name;
  • no two owners share the same telephone number;
  • there may be owners with the same name.

a) Using the above scenario, coupled with the data, explain why:

  1. N !-> T, (no, there are two Freds)  T !->N, OA !-> N (Note OA means {O, A}) look at row 4 and 5.
    These animals are the same age.

  2. >NT -> BAO – The telephone no has different names per animal.

b) Insert a row of date to verify that NBAO !-> T

Spot Dog 4 A.Brown 1111

NBAO cannot be a key so therefore no subset can be a key, such as BNO (a subset).


We have already had a brief look at the meaning of Functional Dependencies (FDs) and some of the language required to understand the
background to Relational Databases.  The process of normalisation is one
approach to the process of data modelling, i.e., the process of moulding data
into a form that can be used to build sound databases.  Normalisation can be made into a rigorous process although we
shall only be covering the formal aspects briefly. Normalisation
is not the answer to all data modelling problems but it is the foundation to all
aspects of understanding the relationships between data and minimising the
unnecessary repetition of data [i.e. avoiding data redundancy]. You have already
met an alternative approach called Entity Relationship Modelling. Both
approaches have their (sometimes passionate!) advocates. We shall not take
sides, indeed the overall methodology we advocate on this course is a convenient
compromise between the two.
>In the lectures so far, and indirectly in the SQL tutorials, we have been talking about
the semantics of relations, which roughly means what relationship(s) exist
between the attributes. These relationships have been expressed in terms of
functional dependencies (FDs).
Essentially normalisation involves decomposing relations into a number of
sub-relations that have certain ‘desirable properties’ which have yet to be
defined. Our approach is thus to assume that all the attributes relevant to an
application are known as is the set of FDs.  Determining the
relevant attributes, let alone the functional dependencies, is usually a highly
non-trivial task.

First Normal Form (1NF)

As long as the types of all the attributes in a relation are
scalar then the relation is in 1NF.
Historically 1NF was defined to disallow multi-valued
attributes, composite attributes, and their combinations (i.e. tuple-valued and
relation-valued attributes). 
Thus traditionally 1NF had to disallow:

  • an attribute having as set of values, a tuple of values, or a combination of both as an attribute for a
    single tuple
  • "relations within relations"

Note that these conditions are no longer viewed as a "problem" (and they never should
have been).


Consider a relation Dept(deptno, dname, loc) with meanings as in the SQL
tutorial and with the primary key underlined.

Consider the following scenario

Deptno dname Loc
10 ACCOUNTS {New York, Toronto, Dallas}
20 RESEARCH Boston

meaning of the above can be taken to be that Accounts is located in three cities
New York, Toronto and Dallas. This way of viewing the data is not (currently)
allowed and the relation is not in 1NF. One way of resolving the problem would
be to view that data as:

Deptno Dname Loc
10 ACCOUNTS New York
10 ACCOUNTS Toronto
10 ACCOUNTS Dallas
20 RESEARCH Boston
30 SALES Chicago


SALES San Francisco

Qu. What problems would such a view of the data pose?

A better solution is to remove the attribute causing the problem into another relation
together with the key of the of the original relation. The new relations are
Dept1(deptno, dname) and Deptloc(deptno, loc). 
The corresponding tables would then be:

Dept1     Deptloc  
deptno dname   Deptno loc


  10 New York


30 SALES   10 Dallas
      20 Boston
      30 Chicago
      30 San Francisco

Qu. Why might this view be better? Discuss. This is usually the preferred version of
1NF. (During revision ask yourself why it might not matter which version you
took. Consider a FD diagram)
Consider now a slight modification of the Emp table used in
the tutorial (no attempt has been made to use the same data but the semantics
are the same except for the addition of the FD   job -> sal.)  The new table now stores a job history as well and so the attributes
{job, hiredate, sal} are repeated many times for the same empno. We could write
the modified relation as
Emp (empno, ename, *(job, hiredate, sal)) 
where the *(   ) means that the attributes enclosed in the brackets after the * are
repeated. Essentially we have a relation within a relation and it can be dealt
with in the same way as the non-scalar values for loc in the pervious example.
One view might be:



1001 SMITH CLERK 01-JAN-78 1000
    SALESMAN 01-FEB-82 1200
    MANAGER 07-MAR-88 2100
1002 JONES TRAINER 01-JUL-88 50
    MANAGER 01-AUG-90 2100
1003 ZAK CLERK 05-OCT-82 1000

Qu. How might this be stored?

Making some assumptions about the answers arrived at to the above Qu. a solution
is to remove the ‘repeating fields’ job, hiredate and sal into another relation
together with empno.

The new 1NF relations are:
Emp1(empno, ename) and
Jobs(empno, job, hiredate, sal).

empno ename   empno job hiredate sal
1001 SMITH   1001 CLERK 01-JAN-78 1000
1002 JONES   1001 SALESMAN 01-FEB-82 1200
1003 ZAK   1001 MANAGER 07-MAR-88 2100
      1002 TRAINER 01-JUL-88


      1002 MANAGER 01-AUG-90 2100
      1003 CLERK 05-OCT-82 1000

Qu Why do we need empno in the relation Jobs?
Qu. What is the implication of now having job -> sal?