CS 411 Database Systems¶
ER Model¶
Entity / Relationship Model
Corresponds well to relational model.
entities¶
- represented by rectangles
attributes¶
- represented by ovals
- an entity has one or more attributes
- a relationship has zero ore more attributes
relationships¶
- represented by diamonds
- connecting different entities
- one-one
- directional edge
- many-many
- undirectional edge
roles¶
- when there are 2 or more edges connecting the same entity and relationship, annotate them with roles
subclasses¶
- use
IsA
triangle - a subclass inheritit all the attributes of the parent class
- no multiple inheritance, subclasses form a tree
constaints¶
- ensure consistency
- enable efficient storage, data lookup etc.
- e.g. only need to allocate how much for this key
keys¶
- the key set of an entity is represented by the underlined attributes
- a set of attributes to distinguish entities in an entity set
- every entity set must has a key
- key of a subclass = key of its root class
UML¶
Unified Modeling Language
- 0..1 = 0 to 1
- 0..* = any number
means
- "Studios" is associated with 0 or more "Movies"
- "Movies" is associated with 0 or 1 "Studios"
- "Movies" is associated with 0 or more "Stars"
- "Stars" is associated with 0 or more "Movies"
Physical Models¶
How to actually store data on machines
Pre-Relational Model¶
- Hierarchical Model
- tree
- Network Model
- extends Hierarchical Model from a tree to a graph
- can navigate from any node
- encode access paths
Relational Model¶
- no fixed connections between data
- access data through computing relations (e.g. join)
Post-Relational Model¶
To solve the impedance mismatch of relational model:
Application code is using object-oriented hierarchical classes while relational model is a bunch of independent records (tables).
- Object-Oriented
- Object-Relational Model
To deal with data in new settings:
Relational Model is created to model enterprise management scenarios, but more scenarios exist now.
- Document Model
- Key-Value Model
- Graph Model
Translating ER Model to Relational Model¶
Entity Set -> Relation¶
- entity -> table
- same key
Relationship -> Relation¶
- relationship of entities -> table
- fields: keys of the entities & its own attributes
- key = tuple of the keys of the entities involved
- if a many-to-one relationship exists then its key(s) is not needed
in this case Beers
's key won't be part of Favorites
's key
Combining Many-One Relationships¶
Reduce to the min tables needed.
Translating Subclasses¶
- ER approach
- store the subclass as another table with the subclass-specific attributes as the fields
- object-oriented approach
- store the subclass as another table with the all the parent attributes + subclass-specific attributes as the fields
- null-value approach
- store everything in the same table with all the base attributes + subclass-specific attributes as the fields
- for records of the base class leave the subclass fields as null
Object-Base Modeling¶
Impedance Mismatch: OOP in application code vs. Relational Models in databases, object vs. tuple
- object identifiers: objects are reference by pointers, tuples by values
- nesting: an object can contain nested objects while a tuple can only contain simple values
- methods: objects have methods while tuples only have attributes
- inheritance: objects can form hierarchy while tuples can't
- encapsulation: object can hide internal implementation while tuples can't
programming paradigms to solve the impedance mismatch
- object-oriented data model
- extend OOP languages with database functions
- object-relational data model
- extend RDMBS with object features
- e.g. PosgreSQL
NoSQL - dealing with data in various new scenarios¶
new reqs due to new settings
- very large data
- very complex data
Key-Value Model¶
- data model managed by application code (schema-on-read)
- value can be string, list, set, hashmap etc.
- e.g. Redis
Graph Model¶
- entities -> nodes; relationships -> edges
- relationship as a first class concept
- similar to network model
- e.g. Neo4j
Document Model¶
- collections of documents
- xml or json
- not necessarily have schemas (schema-on-read)
- can enforce schemas or validation rules if you want
Computing on Data¶
Data Computing Frameworks¶
- relational data
- relational algebra
- -> SQL
- Datalog
- relational algebra
- graph data
- a bunch of graph algos
- key-value data
- MapReduce
Relational Algebra¶
- relations -> new relation
- basic operators
- reduction
- selection \(\sigma\): reduce rows
- projection \(\pi\): reduce columns
- combination
- set operations
- e.g. union \(\cup\), difference \(-\)
- cartesian product \(\times\)
- each record of the new table = a record of one table + a record of another table
- 5-row 2-col table \(\times\) 6-row 3-rol table -> 30-row 5-col new table
- set operations
- renaming \(\rho\)
- reduction
- cartesian product of tables -> reduce -> get useful data
- derived operators
- theta join \(\bowtie_\theta\) -> cartesian product filtered by \(\theta\) (selection)
- \(\theta\) is a condition
- natural join \(\bowtie\)
- combining rows of 2 tables with matching common attribute
- e.g. a regular SQL join on same id
- theta join \(\bowtie_\theta\) -> cartesian product filtered by \(\theta\) (selection)
MapReduce¶
- created at Google in 2004
- Apache Hadoop
- inspired by map & reduce in functional programming
- map M: apply function to each element of the input list
- group G: group into (key, a list of the values of the key)
- reduce R: apply function to the key-value pair
- e.g. word cloud
- M(k, v): for each word w in v -> (w, 1)
- R(k, values) -> (k, sum(values))
Designing Schemas¶
- to resolve redundancy, use regularized form
Functional Dependencies¶
- constraints of the fields of a schema
- A functionally determines B: \(A_1, ..., A_m \rightarrow B_1, ..., B_m\) similar to \(f(A)=B\)
- means the many-to-one mapping similar to a function, not that B can be computed from A
- depends on the domain knowledge
- e.g.
- id -> birthday
- id, course -> grade
Keys¶
- key = a minimal set of attributes that uniquely determine all attributes
- superkey = a superset of a key
- e.g. {SSN} is a key while {SSN, BOD} is a superkey
Rules¶
Basic Rules: Armstrong's Axioms
\(A\rightarrow B\) = A functionally determines B
- reflexivity rule: if \(B\subseteq A\), then \(A\rightarrow B\)
- e.g. id, name -> name
- augmentation rule: if \(A\rightarrow B\), then \(AC\rightarrow BC\)
- e.g. id -> major; id, name -> major, name
- transitivity rule: if \(A\rightarrow B\) and \(B \rightarrow C\), then \(A\rightarrow C\)
Derived Rules
- splitting rule: if \(A\rightarrow BC\), then \(A\rightarrow B\) and \(B \rightarrow C\)
- pf
- given \(A\rightarrow BC\)
- \(BC \rightarrow B\) by reflexivity
- \(A \rightarrow B\) by transitivity
- similarly \(A \rightarrow C\) by transitivity
- e.g. id -> name, major; id -> name && id -> major
- pf
- combining rule: if \(A\rightarrow B\) and \(A\rightarrow C\), then \(A \rightarrow BC\)
- pf
- given \(A\rightarrow B\) and \(A\rightarrow C\)
- \(A \rightarrow AB\) and \(AB \rightarrow BC\) by augmentation
- \(A \rightarrow BC\) by transitivity
- e.g. id -> name && id -> major; id -> name, major
- pf
to prove drinker, bar -> price
- drinker, bar -> bar by reflexivity
- drinker, bar -> bar, beer by combination
- drinker, bar -> price by transitivity
Attributes Closure¶
- the closure of a set of attributes = the set of attributes it can functionally determine, denoted by \(^+\)
- \(\{A_1, ..., A_n\}^+=\{B_1, ..., B_n\}\) denotes the closure of A is B
Normal Forms¶
- desired properties
- minimize redundancy
- avoid info loss
- preserve dependency
- ensure good query performance
- 1NF First Normal Form
- each attribute contains only atomic values (simple values, not list, set, etc.)C
BCNF Boyce-Codd Normal Form¶
- Codd invented the relational model
- start of declarative data management (as opposed to procedural)
- bad FD: \(A\rightarrow B\) in relation R but A isn't a superkey
- e.g. id determines birthday but not course taken
- a relation R is in BCNF \(\iff\) whenever there's a nontrivial FS for R, \(A\rightarrow B\), A is a superkey for R
- \(A\rightarrow B\) is trivial if \(B\subseteq A\)
- i.e. if an attribute determines something in a row then it determines everything in the row
- i.e. no bad FDs
- BCNF Decomposition
- convert a relation not satisfying BCNF to one that does
- algo \(BCNF(R, F)\)
- if exists a bad FD \(A\rightarrow B\) in \(F\) (FDs of relation \(R(A, B, C)\))
- decompose \(R(A, B, C)\) into \(R_1(A, B)\) and \(R_2(A, C)\)
- compute FDs for \(R_1\) and \(R_2\) as \(F_1\) and \(F_2\)
- return \(BCNF(R_1, F_1)\cup BCNF(R_2, F_2)\)
- else return \(R\)
- if exists a bad FD \(A\rightarrow B\) in \(F\) (FDs of relation \(R(A, B, C)\))
- no FDs in \(R_2\) -> no bad BDs -> \(R_2\) is in BCNF
Lossless Decomposition¶
- decomposition may be lossy
- decompose -> natural join -> got extra tuples, meaning some info was lossed during decomposition
- BCNF is a lossless decomposition
- if we decompose \(R(A, B, C)\) into \(R_1(A,B)\) and \(R_2(A,C)\) due to \(A\rightarrow B\), then \(R=R_1\bowtie R_2\)
Dependency-Preserving Decomposition¶
- a decomposition is dependency-preserving if after the decomposition all the FDs can still be found (directly checked or implied)
- BCNF may not preserve dependency
3NF 3rd Normal Form¶
- lossless decomposition (like BCNF)
- dependency-preserving decomposition (unlike BCNF)
- a relation \(R\) is in 3NF \(\iff\) for each nontrivial FD \(A\rightarrow B\) in \(R\), either:
- \(A\) is a superkey for \(R\) (same as the BCNF condition)
- \(B\) is a prime attribute (a member of a key) for \(R\)
- s.t. a key won't be splitted
Multivalued Dependencies¶
- dependencies other than FDs
- multivalued dependencies (MVD)
- inclusion dependencies (IND)
- join dependencies (JD)
- MVD Multivalued Dependencies \(A\twoheadrightarrow B\)
- like FD but \(B\) is a set of values instead of a unique value
- FD is a special case of MVD
- e.g. Rajesh's majors are {EE, Econ}
- use 4NF to reduce tables with MVDs
SQL Queries¶
SQL Basics¶
- created at IBM in 1970s
- standardized in 1986
- vendors support various subsets of it
- high-level declarative language
- let the system optimize itself
- basic form: SELECT ... FROM ... WHERE
- closure property: input: one or more relations; output: a relation
- -> can use it in query directly
- FROM
- cartesian product
- tuple variable: alias
- e.g. FROM Favorites F
- WHERE
- LIKE
- e.g. LIKE '%GREEN%'
- return the TRUE (not (FALSE or UNKNOWN)) records
- LIKE
- NULL
- missing value or inapplicable
- when compared it with a real value it returns UNKNOWN
- 3-Valued Logic: TRUE/FALSE/UNKNOWN
- UNKNOWN AND TRUE = UNKNOWN
- UNKNOWN AND FALSE = FALSE
- UNKNOWN OR TRUE = TRUE
- UNKNOWN OR FALSE = UNKNOWN
- subquery: nested query, query inside a query
SQL Aggregation¶
- FROM-WHERE: generate tuples
- use attributes of tuples
- GROUP-BY: orgnize groups
- use attributes of tuples
- optional
- Functions: aggregate each group
f(attributes)
e.g.SUM()
,AVG()
,COUNT()
,MIN()
,MAX()
f(tuples)
e.g.COUNT()
- doesn't consider Null values
- if only null values, output Null or 0 (for count)
DISTINCT
eliminates duplicates before aggregation- e.g.
SELECT COUNT(DISTINCT price)
- e.g.
- HAVING: filter groups
- use attributes of groups
- SELECT attributes to output
- use attributes of groups
MongoDB Queries¶
lookup¶
like left join in sql
Neo4j Cypher Queries¶
- basic queries
- MATCH
- like FROM in sql
- specify a pattern for graph traversal
MATCH (:Drinker)-[drinks:Drinks]->(beer:Beer)
drinker
&beer
are the variables
- WHERE
- like WHERE in sql
- RETURN
- like SELECT in sql
- MATCH
- a table of an entity in sql -> a node
- a table of a relationship -> a relationship
- field of a table -> property of a node/relationship
- sql join -> traverse pattern
- aggregate
- no separate GROUP BY clause
- SELECT \(A_1, ..., A_n, F_1, ..., F_m\) GROUP BY \(A_1, ..., A_n\) in sql -> RETURN \(A_1, ..., A_n, F_1, ..., F_m\)
- or use WITH
- no separate HAVING clause
- collect & unwind
- opposite
- no separate GROUP BY clause
- chaining
- like in sql
Database Manipulation¶
- manipulation by purposes
- organizing
- create/delete database/schema/table
- modifying
- insert/delete/update tuples to a table
- agumenting
- create indexes and views
- regulating
- create constraints
- organizing
- tools
- system
- console
- client
- language
- system
Organizing¶
- Database -> (Schema ->) Table
- some Database -> Schema -> Table
- e.g. Oracle, PostgreSQL
- some Database -> Table
- e.g. MySQL
- some Database -> Schema -> Table
Augmenting - Views¶
View = a virtual table created using a query i.e. saved query result
It's just a virtual table so you can't do updates/inserts directly on it but via the underlying table(s). Therefor, for a view to be updateable, it needs to have sufficient attributes from the base table
- join multiple tables -> not updateable
- too few attributes -> not updateable
Regulation¶
- Key Constraints
- primary key
- unique key
- foreign key
- can declare action
- e.g. on delete cascade, on update set null
- if no declaration then reject
- can declare action
- General Constraints - checks & assertions
- checks
- attribute-based
- tuple-based
- during insertion/update
- assertions
- database-based i.e. a check over the table
- during insertion/update
- not uniformy supported in RDBMS
- MySQL doesn't support checks & assertions
- PostgreSQL support checks without subqueries but not assertions
- no major RDBMS supports assertions
- checks
- triggers
- limitations of checks & assertions
- timing: do checks on each insertion/update -> expensive
- flexibility: checks are limited to attributes or tuple-based constraints
- handling: can't control what to do when voilated
- event-condition-action
- event: timing
- e.g. insert on Sells
- condition: flexibility
- any sql boolean-valued expression
- action: handling
- any sql statement
- event: timing
- widely implemented
- functionality & syntax may not be uniform
- limitations of checks & assertions
Accessing & Indexing Data¶
- main memory (volatile)
- query parser
- query optimizer
- query processor
- indexes
- disk
- buffer manager
- storage manager
- storage
indexing data¶
- provide a map for locating data
- s.t. don't need to scan the whole db for a small fraction of the data
- type
- dense index
- pointer to every key
- others: to every block etc.
- dense index
accessing data¶
- data is big, cannot be fit into main memory
ISAM¶
- Indexed Sequential Access Method
- kind of like a segment tree (but not really)
- each node has \(F\) values providing \(F+1\) segments/pointers for \(F+1\) children
- static tree structure: insertion/deletion only affect the leaf nodes, so the num of levels before a leaf node never changes
- search cost: num of hops to reach a leaf = tree height \(h=log_FN\)
- num of random accesses / disk pointers
- \(F\) fanout = num of children for each node
- each leaf node/page only has \(F\) values, so when inserting a value within the range of a leaf node then we add an overflow page underneath (kind of like how hash tables work)
- cons
- can become uneven
- long & short paths in different parts of the tree
- making search time unpredictable
- nodes can become full or empty
- full -> insertion requires overflow
- empty -> waste disk storage space
- can become uneven
B+ Tree¶
- B Tree
- self-balancing for indexing dymanic data
- uniform accesses in \(O(log(n))\)
- optimized for block-oriented access
- external memory
- https://kubokovac.eu/gnarley-trees/Btree.html
- self-balancing for indexing dymanic data
- B+ tree: small enhancedment of B Tree
Structure¶
- \(d\) degree
- each node has \(n\) keys & \(n+1\) pointers
- max \(n=2d\)
- min \(n\approx d\) = 50% of the max
- leaf node \(n=\lceil d \rceil\)
- internal node \(n=\lfloor d \rfloor\)
- each node is a block
- max \(d=170\)
- assuming
- key size 4 bytes
- pointer size 8 bytes
- block size 4096 bytes
- \(4n+8(n+1)=12n+8<=4096\)
- \(n<=340\)
- max \(d\) = \(0.5n=170\)
- max fanout \(F=n+1=341\)
- assuming
- typical \(d=100\)
- typical fill-factor \(67\%\)
- average fanout \(F=(2d+1)\times 67\%=133\)
- top few levels can be stored in memory
Lookup¶
WHERE grade > 80
- locate key -> sequentially traverse through next-leaf pointers
Maintenance¶
- perform local changes when a node becomes too full or empty
- involves only siblings
- solutions not unique
- make sure
- node capacity in 50% ~ 100% (except root)
- every node has \(n+1\) pointers if has \(n\) keys
- each key = min of key of the right-pointer subtree
- each key > any key in the left-pointer subtree
- may trigger more local changes recursively upstream i.e. propagate up
Insertion¶
example
- insert 98, but leaf full -> split
- insert 75, but leaf full, and parent pointers also full -> split leaf & parent
- insert 72
rules
- insert key to leaf node \(N\) with pointer \(P\)
- if node becomes too full, split node \(N\) into node \(N\) & \(N'\) with pointer \(P'\)
- insert \(P'\) to parent recursively, or redistribute pointers to neighbors
- adjust keys s.t. each key = min of keys of its right-pointer subtree
- the tree adds one level if the root splits
- new root will have only 1 key (root can have less than \(d\) keys)
Deletion¶
rules
- delete key
- if node becomes too empty
- redistribute pointers from siblings
- merge with sibling
- delete pointer from parent recursively
- adjust keys s.t. each key = min of keys of its right-pointer subtree
- the tree reduces one level if root's only key was deleted
example
- delete 90 -> redistribute
- delete 92 -> get a pointer from left sibling to balance and update root values
- delete 65
Static Hash Table¶
- hash function map key to k-bit code -> \(2^k\) buckets
- req: uniform distribution of keys
- bucket \(h(K)\) stores pointers for records of key \(K\)
- in external storagex
- a bucket = a block
- use overflow blocks when needed
- see Data Structure#hash
- good for static data
- arrange hash functions & bucket sizes s.t. no overflow
- very good lookup performance
- bad for dynamic data
- may over allocate for growth leading to low block utilization
- lookup performance degrades heavily when too many overflow blocks
Dynamic Hash Table¶
- allocate codes to buckets dynamically, depending on how many buckets needed
- allocate a lot of bits but only use a number of bits, depending on the need
- bucket number changes dynamically
extensive hash table¶
- order bucket by \(i\) MSB (most significant \(i\) bits)
- the most significant \(i\) bits across all the codes in each bucket are the same
- num of buckets \(n=2^i\)
- can merge buckets sharing MSB i.e. neighboring buckets
- store \(i\) value in the nub for each bucket
- directory: stores pointers for each bucket
- insertion
- if a bucket using \(j<i\) bits is too full, split into 2 buckets using \(j+1\) bits and distribute the keys
- if a bucket using \(i\) bits is too full, split into 2 buckets using \(i+1\) bits and distribute the keys
- if a bucket using \(j<i\) bits is too full, split into 2 buckets using \(j+1\) bits and distribute the keys
linear hash table¶
- order bucket by \(i\) LSB (least significant \(i\) bits)
- the least significant \(i\) bits across all the codes in each bucket are the same
- can merge buckets sharing LSB i.e. buckets \(2^{i-1}\) buckets apart
- no directory
- has overflow blocks
- maintains a max avg bucket capacity \(u\)
- \(u\) = num of keys / num of buckets
- lookup
- get hash code -> find \(i\) LSB of the hash codes \(m\) ->
- if \(m\leq n\) then find the bucket numbered \(m\)
- if \(m>n\) then flip the highest bit of \(m\) and go to the corresponding bucket
- get hash code -> find \(i\) LSB of the hash codes \(m\) ->
- insertion
- use overflow blocks if necessary
- if hash table too full i.e. capacity over defined max \(u\), add a new block \(n+1\) and redistribute
Different Indexes¶
- clustered vs. unclustered
- clustered index
- basically sort
- data is sorted according to the index
- only one per table
- unclustered index
- stored separately
- cam have multiple on a table
- clustered index
- dense vs. sparse
- dense index
- each record is in the index
- unclustered index is always dense
- because data isn't sorted from the unclustered index, so need to point to all record to be able to find
- sparse index
- only some records are in the index
- dense index
- primary vs. secondary
- primary index
- index on primary key
- secondary index
- all others
- primary index