Datenbankenlernen.de

Database Systems

Search Results


0.1 Course Overview and Motivation

0.1.1 The Truth about Databases

Why is database systems a vertical topic? Which are those topics? Is this lecture only about database systems or software in general?

Additional Material
Further Reading
0.1.2 Architecture of a DBMS

What are the different software layers of a database system? How do they relate do the different vertical topics? What are the major tasks of store, indexer, and query optimizer? What are system aspects? What is the conflict of computation versus data access about? And does this conflict relate to the different layers of a database system?

0.1.3 Structure of this Course

How is the material covering the different database layers mapped to the weeks of the semester? What exactly does the recursive/iterative structure mean?


0.2 History of Relational Databases

0.2.1 A Footnote about the Young History of Database Systems

Why would it make sense to use a database system anyway? How did it all start? What was a major change introduced by the relational model? What were major developments in database history?

0.2.2 Relational Database --- A Practical Foundation of Productivity

What was the problem with data management in the 60ies? What is associative addressing in the context of a database system? What is a data model? In the relational model, does the order of the columns in a schema matter? In the relational model, does the order of the rows in a schema matter? What is the primary goal of relational processing? Is relational algebra intended to be used as a language for end-users?

Material
Literature

1.1 Storage Hierarchies

1.1.1 A Simple Storage Hierarchy

What are the properties of ideal computer memory? What is the core idea of the storage hierarchy? What is the relationship of storage capacity and access time to the distance to the computing core? What is the relationship of costs/bandwidth to the distance to the computing core? What are typical access times? How does this translate to relative distances? What are typical sizes of the different storage levels? How do they translate to relative distances? What are the tasks of each level of the storage hierarchy? How would I use a storage hierarchy to cache only reads? How do I use a storage hierachy to cache both reads and writes? What is inclusion? What is a data replacement strategy?

Additional Material
Literature
Further Reading
1.1.2 The All Levels are Equal Pattern

What is the central observation here? What does this mean for practical algorithms? How can I exploit this to solve a problem on a specific layer of the storage hierarchy? How to misunderstand this pattern?

1.1.3 Multicore Storage Hierarchies, NUMA

How are the two terms core and CPU related? How does a typical multicore storage hierarchy look like? What is different compared to the storage hierachy? Why isn't L1 shared as well among multiple cores? What is a Non-Uniform Memory Access (NUMA) architecure? What is different compared to the multicore architecture? What does this imply for accesses to DRAM? Bonus Question: How again does this relate to The All Levels Are Equal?

Additional Material
Further Reading

1.2 Storage Media

1.2.1 Tape

What are the major properties of tape? Why would tape still be important these days? What is a tape jukebox? How does it work? What does this imply for access times?

Additional Material
Literature
1.2.2 Hard Disks, Sectors, Zone Bit Recording, Sectors vs Blocks, CHS, LBA, Sparing

Why would hard disks still be important these days? What is a platter? What is a diskhead and a disk arm? How do tracks and cylinders relate? What is the difference of a circular sector from a HD sector? What is zone bit recording? What does this imply for sequential access? Where are self-correcting blocks used? What is the difference of HD sectors and operating systems blocks? What is physical cylinder/head/sector-addressing? What is logical cylinder/head/sector-addressing? How does it relate to the former? What is logical block addressing? What is sparing? What does this imply for sequential accesses?

Additional Material
Literature
Further Reading
Video
1.2.3 Hard Disks, Sequential Versus Random Access

What exactly is a random access? What are its major components? What is a sequential access? How to estimate the costs of a sequential access? What is track skewing? How did hard disks evolve over the last decades? What do we learn from that? Bonus Question: What does this imply for index structures?

1.2.4 Hard Disks, Controller Caching

What exactly is the hard disk cache? How is it related to the storage hierarchy? What do we gain by using this cache? What do we lose? What does this imply when storing data on disk? What is the elevator optimization?

1.2.5 The Batch Pattern

What is the central observation here of the batch pattern? What are the advantages? What are the disadvantages? What does this mean for practical algorithms? What are possible applications?

1.2.6 Hard Disk Failures and RAID 0, 1, 4, 5, 6

Why should I worry about hard disk failures? What is the core idea of RAID? What is the impact on performance of RAID 0? Is my data safer by using RAID 0? What RAID 1? How much space does it waste? What is the difference of RAID 4 and RAID 5? How many disks do I need for a RAID 4 or 5 system? Which sequential read performance can I expect in a system with n disks? And which write peformance? How many disk failures may a RAID 4 or 5 system survive? What is the difference of RAID 6 compared to RAID 5?

Additional Material
Literature
Further Reading
1.2.7 Nested RAID Levels 1+0, 10, 0+1, 01

What is RAID 10? Why would we call this nested RAID? And what is RAID 01 then? Is it possible, in principal, to combine any kinds of RAID-levels?

1.2.8 The Data Redundancy Pattern

Why is data redundancy good? And for what exactly? Why is data redundancy bad? And for what exactly? Can you list three examples where data redundancy is used for good?

Additional Material
Literature
Further Reading
1.2.9 Flash Memory and Solid State Drives (SSDs)

What are the major properties of flash devices? What is an SSD? What is the major differentiator over hard disks w.r.t. performance? In an SSD, what are blocks and superblocks? What does this mean for write operations? What is write amplification? How would this affect the storage layer of a database system? Why trim? What is the job of the SSD controller? How does it relate to RAID? How does it relate to volatile memory? Which sequential bandwidth can we expect these days? And which random access time?

1.2.10 Example Hard Disks, SSDs and PCI-connected Flash Memory

Again, what are typical performance characteristics of hard disks and SSDs? Would you buy an SAS disk for your PC? What is a PCI flash drive? Why would that be a good idea? Why do I get considerably more random read operations per second than one divided by the random access time?


1.3 Fundamentals of Reading and Writing in a Storage Hierarchy

1.3.1 Pulling Up and Pushing Down Data, Database Buffer, Blocks, Spatial vs Temporal Locality

What does pulling up data mean? What does pushing down data mean? Did you see the database buffer anywhere? What is temporal locality? And what is spatial locality then? How are the two related?

Material
Literature
Additional Material
Literature
Further Reading
1.3.2 Methods of the Database Buffer, Costs, Implementation of GET

What are the most important methods of the database buffer? What does GET do? What may happen when requesting a page that is not in the buffer? What are the costs involved? Who would evict a page?

1.3.3 Pushing Down Data in the Storage Hierarchy (aka Writing), update in-place, deferred update

What is direct write? What is indirect write? What are their pros and cons?

1.3.4 Twin Block, Fragmentation

What is the core idea of twin block? What are its pros and cons?

1.3.5 Shadow Storage

What is the core idea of shadow storage? What are its pros and cons? What about fragmentation in shadow storage? Where is this used outside databases?

1.3.6 The Copy On Write Pattern (COW)

When is this pattern applicable? What is the core idea? Where is it applied?

1.3.7 The Merge on Write Pattern

When is this pattern applicable? What is the core idea? Where is it applied? What is the relationship to COW?

1.3.8 Differential Files, Merging Differential Files

What is the main analogy from real life for differential files? To which other patterns does this relate? What are its pros and cons? How to merge differential files with the read-only DB without halting the database?

1.3.9 Logged Writes, Differential Files vs Logging

What is the difference of logging and differential files? What are its pros and cons? Is it possible to combine logging and differential files?

1.3.10 The No Bits Left Behind Pattern

Why shouldn't we leave any bits behind? Or in other words: why should we avoid wasting bits? What does this mean for data layouts? Why would I cluster data with spatial locality on the same virtual memory page, disk page, disk sector, cache line? Why would I keep data with little spatial locality on different virtual memory pages, disk pages, disk sectors, cache lines?


1.4 Virtual Memory

1.4.1 Virtual Memory Management, Page Table, Prefix Addressing

How are virtual memory addresses translated to physical addresses? How does it work? What is the role of the page table? What is prefix addressing?

Additional Material
Literature
1.4.2 Retrieving Memory Addresses, TLB

What happens when referencing a specific virtual memory address? What kind of translation lookaside buffer (TLB) and cache misses may occur?

Additional Material
Literature

2.1 Overview

2.1 Overview

What are the principal mapping steps from relations to devices? Which of those steps are related to data layouts? Why linearize tuples? Why linearize values? And which comes first?


2.2 Page Organizations

2.2.1 Slotted Pages: Basics

What is the core idea of slotted pages? What is a slot? What is the relationship to physical data independence? Can I move data inside a slotted page? Like how? What is a forward tuple-ID? What are they good for? What might be a problem?

Additional Material
Literature
2.2.2 Slotted Pages: Fixed-size versus Variable-size Components

How to organize fixed-size components on a slotted page? Where to store the slot array then? What is linear addressing? What can we do with variable-sized components? Can we move them around on a page?

Additional Material
Literature
2.2.3 Finding Free Space

How would we find some free space for new stuff? What is a segment?

Additional Material
Literature

2.3 Table Layouts

2.3.1 Data Layouts: Row Layout vs Column Layout

Why would we linearize data values in row layout? Why would we linearize data values in column layout? Which type of layout is better for which type of query? What is a row store? What is a column store?

2.3.2 Options for Column Layouts, Explicit vs Implicit key, Tuple Reconstruction Joins

What are correlated columns? Could they also be uncorrelated? And what would that imply? What is an implicit key? What does this mean for tuple reconstruction joins? What is an explicit key? What does this mean for tuple reconstruction joins? How do I get from explicit key to implicit key?

2.3.3 Fractured Mirrors, (Redundant) Column Grouping, Vertical Partitioning, Bell Numbers

What is the relationship of fractured mirrors to row and column layouts? Why would this make sense? This is good for which type of queries again? What are the drawbacks? What is column grouping and its relationship to vertical partitioning? Which type of queries would benefit from this? How would we introduce data redundancy to column grouping? How many vertical partitionings are there? What has this to do with The Data Redundancy Pattern?

Additional Material
Literature
2.3.4 PAX, How to choose the optimal layout?

What is PAX about? How does this relate to horizontal partitioning? How does the horizontal partitioning relate to blocks? How does PAX relate to column layouts? What would this be good for? And how do I get to the optimal data layout anyway?

Additional Material
Literature
2.3.5 The Fractal Design Pattern

What is fractal (self-similar) design about in the context of database sytems? Why would this help me to devise effective solutions to problems? How does this relate to The All Levely are Equal Pattern? Can you name a couple of examples of fractal design in databases that we have already seen?


2.4 Compression

2.4.1 Benefits of Compression in a Database, Lightweight Compression, Compression Granularities

Compression is mainly about saving storage space, right? But wait: wasn't there something more important? What is the major trade-off you have to keep in mind here? Compressing data costs something in addition! You can only lose w.r.t. overall query response times, right? What are compression granularities? How do they affect accessibility and compression ratio of your data (in general)?

Additional Material
Literature
Further Reading
2.4.2 Dictionary Compression, Domain Encoding

What is a dictionary? What do we gain by using one? How does this relate to CREATE DOMAIN? How would dictionaries affect query processing? Is a dictionary something a user has to be aware of? How does dictionary compression relate to domain encoding? What are the pros and cons of dictionary compression?

2.4.3 Run Length Encoding (RLE)

How does run length encoding method work? What is its relationship to sorting? To lexicographical sorting? What might be possible pros and cons?

2.4.4 7Bit Encoding

What is the major idea of this method? How would this relate to domain encoding? How much storage space do I lose? How much do I gain?

Additional Material
Further Reading

3.1 Motivation for Index Structures, Selectivities, Scan vs. Index Access on Disk and Main Memory

3.1 Motivation for Index Structures, Selectivities, Scan vs. Index Access on Disk and Main Memory

What are the major analogies of indexing in real life? Why index? What is selectivity? What is high selectivity? What is low selectivity?

Additional Material
Literature
Further Reading

3.2 B-Trees

3.2.1 Three Reasons for Using B-Tree Indexes, Intuition, Properties, find(), ISAM, find_range()

The three reasons for using B-tree indexes are ... ? What is the intuition for B-tree indexes? What is a node? What is a leaf? How many keys do they have? What happens if the are root? What is k? What is k*? What are the major properties of a B-tree? What is the relationship of B-trees to interval partitioning? How do we search for a particular key in a B-tree, i.e. how do we execute a point query? How do we search for a particular interval in a B-tree, i.e. how do we execute a range query? What is the index sequential access method (ISAM)? How does it relate to point and range queries?

Additional Material
Literature
3.2.2 B-Tree insert, split, delete, merge

How do inserts in real B-trees work? Why would I split a leave or a node? And how does this split work in principal? How do I implement the split operation in an object-oriented programming language? What does this mean for leaf and node splits? How do I delete data from a B-tree and its leaves and nodes? What may happen then? How are delete() and insert() related? How are merge() and split() related?

Additional Material
Literature
3.2.3 Clustered, Unclustered, Dense, Sparse, Coarse-Granular Index

What is a clustered index? How many clustered indexes are possible for a table? What is an unclustered index? How many unclustered indexes are possible for a table? What is a dense index? What is a sparse index? And how are they related? How is a coarse-granular index related to a sparse index? How is a sparse index related to having no index at all?

Additional Material
Literature
3.2.4 Covering and Composite Index, Duplicates, Overflow Pages, Composite Keys

How is a covering index related to an unclustered index? How is it related to a clustered index? What exactly is covered? What is a composite index? How does it relate to a covering index? How do we build a duplicate index in a B-tree ? What is an overflow page? How are composite keys related to duplicates?

Additional Material
Literature
3.2.5 Bulk-loading B-Trees or other Tree-structured Indexes

How do I bulkload a B-tree? And why would I want to do this? What should be kept in mind for the free space in nodes and leaves when bulkloading?

Additional Material
Literature

3.3 Performance Measurements in Computer Science

3.3 Performance Measurements in Computer Science

How to determine which algorithm, index, or whatever is better? O-Notation is good enough, right? What are the three ways to measure the performance of a computer program? What is asymptotic complexity? What is a cost model? What is a simulation? What might be the problem of all of the former methods? What is an experiment? What could be its problem? How are the three ways to measure the performance of a computer program correlated to effort/cost, reality, and generalizability?

Additional Material
Literature
Further Reading

3.4 Static Hashing, Array vs Hash, Collisions, Overflow Chains, Rehash

3.4 Static Hashing, Array vs Hash, Collisions, Overflow Chains, Rehash

Why do some people say that the three most important techniques in computer science are hashing, hashing, and hashing? What is the core idea of hashing? What is the runtime complexity? How does it relate to arrays? What is a collision? How can I handle it? Why would I rehash? And what does that mean?

Additional Material
Literature
Further Reading

3.5 Bitmaps

3.5.1 Value Bitmaps

What is the core idea of a bitmap? What is a bitlist? What is the size of an uncompressed bitmap? What are typical bitmap operations and why are they very efficient? How is the cardinality of a domain related to the size of the bitmap? What are applications of bitmaps?

Additional Material
Literature
Further Reading
3.5.2 Decomposed Bitmaps

When does a decomposed bitmap make sense? What is the core idea of a decomposed bitmap? What does this imply for bitmap operations? What do we gain in terms of storage space?

3.5.3 Word-Aligned Hybrid Bitmaps (WAH)

What is the relationship of word-aligned hybrid bitmap WAH to 7-Bit and RLE? How does WAH work? How can we use this to compress a bitlist? How do we decompress?

Additional Material
Further Reading
Java Libary
3.5.4 Range-Encoded Bitmaps

What is a range-encoded bitmap? What changes over the standard uncompressed bitmap? Which type of queries can be efficiently supported by range-encoded bitmaps? Can we also execute a point query? How exactly? What are the space requirements when compared to the standard uncompressed bitmap?

3.5.5 Approximate Bitmaps, Bloom Filters

What is the core idea of a bloom filter? When can it be applied? What is a false positive? What are the lookup semantics of a bloom filter?

Material
Literature
Additional Material
Literature

4.1 Join Algorithms

4.1.1 Applications of Join Algorithms, Nested-Loop Join, Index Nested-Loop Join

Why are joins important? What are typical applications in data management? What is the possible impact of joins on query performance? What are the four principal classes of join algorithms? What is nested-loop join (NL)? What is its runtime complexity? For which type of join predicates does it work? What is index nested-loop join (INL)? What is required to run this algorithm? For which type of join predicate does it work? What is the runtime complexity of this algorithm?

Additional Material
Literature
4.1.2 Simple Hash Join

How does simple hash join (SHJ) work? What is the relationship to index nested-loop join?

Additional Material
Literature
4.1.3 Sort-Merge Join, Co-Grouping

How does sort-merge join (SMJ) work? What does the algorithm do in principle? What has to be considered when deciding which pointer to move forward? How would I treat a situation where duplicates exist in both join columns? What is the relationship of joins and CoGroups? What is CoGrouping?

Additional Material
Literature
4.1.4 Generalized Co-Grouped Join (on Disk, NUMA, and Distributed Systems)

Why are CoGroups a general property of joins rather than a property of a specific join algorithm? What is the core idea of generalized co-grouped join? Which three special cases can be implemented with this algorithm? What does the group()-function do in the algorithm? Which phases can be identified in the algorithm? Which algorithm executes the actual join then? Could that be a cross product? How can Generalized Co-Grouped Join be optimized to avoid calling that algorithm in certain cases anyhow?

4.1.5 Double-Pipelined Hash Join, Relationship to Index Nested-Loop Join

When does it make sense to consider using double-pipelined hash join? How does the algorithm work? In what sense is this algorithm symmetric? What is double-pipelined index join? What is the relationship to Index Nested-Loop Join? Does Double-Pipelined Hash Join have a Build and a Probe Phase? What is the difference of DPHJ and INLJ w.r.t. the join results produced over time?


4.2 Implementing Grouping and Aggregation

4.2 Implementing Grouping and Aggregation

Would it be fair to say that grouping and aggregation is just like co-grouping however using a single subset for each co-group? Which phases can be identified in hash-based grouping? What exactly is kept in the hashmap? Which phases can be identified in sort-based grouping? What is `group closing'? How to handle the last group?

Additional Material
Literature

4.3 External Sorting

4.3.1 External Merge Sort

When does it make sense to run external merge sort? What is run generation? How does it work? What is the result of it? What happens during the merge phase, i.e. pass > 0? What happens during a single merge? What is the fan-in F? How could it be defined? Why would it be a bad idea to implement merge sort with a fan-in of F=2? Why would it also be a bad idea to implement merge sort with a fan-in F=m-1 (where m is the number of pages available in main memory)? How are the different runs organized in main memory? What exactly is stored for each run? Which runs are typically merged first? You did not forget that this specific variant is a very simple version of the algorithm, right?

Additional Material
Literature
Video
4.3.2 Replacement Selection

Which overall impact does Replacement Selection typically have when used with External Merge Sort? What is the typical size of a run generated by Replacement Selection compared to Quicksort? What is the purpose of the heap and the list? How are they implemented physically? Which condition must hold at all times for the list and the heap? How does the algorithm start? What is the central idea then? When is an element inserted into the heap? When is it inserted into the list? When does a run end? What can be said about the cache-efficiency of this algorithm?

4.3.3 Late, Online, and Early Grouping with Aggregation

What is late aggregation? What is the amount of I/O you can expect then? What is online grouping? Can this only be used with grouping? What do you gain in terms of I/O? What is early aggregation? What do you gain in terms of I/O? What are drawbacks of both Online and Early Grouping? What has a huge impact on the actual savings for Early Grouping?


5.1 Overview and Challenges

5.1.1 Query Optimizer Overview

What is the task of the query parser? What is the high-level task of the query optimizer? What is the relationship of the WHAT and the HOW aspects in that process? Why would the query optimizer need statistics about the data? Where does it obtain those statistics? What is query plan interpretation? What is the relationship of query optimization and programming language compilation?

Additional Material
Literature
Video
5.1.2 Challenges in Query Optimization: Rule-Based Optimization

What is the canonical form of a query? When would that be a DAG (directed acyclic graph)? What does rule-based optimization do in principle? What does a single rule do? Why would I break up conjuncts of selections? Why does it make sense to push down selections? When exactly is that possible? Why should I strive to introduce joins whenever possible? And when is that possible anyway? When is it possible to push down projections? What has to be considered when doing this? What does the data layout of the store have to do with pushing down projections? So what are the most important rules?

5.1.3 Challenges in Query Optimization: Join Order, Costs, and Index Access

What is join order? What may be the effect of picking the wrong join order? Does this problem only occur for joins? How do I pick the right access plan? (OK, not at all, the query optimizer does this...) How does the query optimizer decide which access plan to take? What are the possibly different costs of picking a scan, clustered index, unclustered index or covering index? How could I estimate those costs in a disk-based system? Why should I be very careful with unclustered indexes? What may happen if the selectivity estimates of a query are wrong?

Additional Material
Literature
5.1.4 An Overview of Query Optimization in Relational Systems

What is a physical operator? What is the algebraic representation of a query? What is a cost estimation in the context of the query optimizer? What type of search space did System-R use? What must be considered when applying these transformations? What is an interesting order? Do bushy join sequences require materialization of intermediate relations? When would they not?

Material
Literature
Additional Material
Literature

5.2 Cost-based Optimization

5.2.1 Cost-Based Optimization, Plan Enumeration, Search Space, Catalan Numbers, Identical Plans

Why is join order important? Which other binary operators may be affected by this problem? What is the overall idea in cost-based optimization? What is a left-deep tree? What is the number of possible left-deep plans for n input relations? What is a bushy tree? What is the number of possible bushy plans for n input relations? What do the Catalan numbers have to do with this? What is the assumption for the different plans when counting bushy plans?

Additional Material
Literature
Further Reading
5.2.2 Dynamic Programming: Core Idea, Requirements, Join Graph

What is the join graph? What is the core idea of dynamic programming? What are the two requirements such that dynamic programming makes sense? What is an optimal subplan?

Additional Material
Literature
Further Reading
5.2.3 Dynamic Programming Example without Interesting Orders, Pseudo-Code

How does dynamic programming work in principle when applied to the join enumeration problem? What does the pruning step do? How many different entries does the table have at each iteration step? What does the mergePlan()-method do? How would it merge two subplans?

Additional Material
Literature
Further Reading
5.2.4 Dynamic Programming Optimizations: Interesting Orders, Graph Structure

What is an interesting order? Why do we have to change the simple dynamic programming algorithm to consider interesting orders? When exactly would we change the algorithm? Why would we exploit the join graph in dynamic programming? What effect does this have?

Additional Material
Literature
Further Reading

5.3 Query Execution Models

5.3.1 Query Execution Models, Function Calls vs Pipelining, Pipeline Breakers

How to execute a plan? How to translate a plan to an executable plan? Why not use function libraries? What is the problem with intermediate results? What is the core idea of pipelining? What type of pipelining are we talking about here? What is a pipeline breaker? And why would that matter in query processing? What are the three stages of simple hash join/index nested-loop join? What happens in each stage? What are the three stages of quicksort? What happens in each stage? What are the three stages of external merge sort? What happens in each stage? For all of these algorithms: which of those stages actually block the pipeline? What are the blocking and non-blocking building blocks of query pipelines?

Additional Material
Literature
5.3.2 Implementing Pipelines, Operators, Iterators, ResultSet-style Iteration, Iteration Granularities

How do we implement a pipeline, i.e. how do we get from the high-level idea of pipelining to a concrete implementation of pipelining? What is the core idea of the operator interface? What's next()? What happens in a hasNext()-call in an iterator? What is a chunk? Which important special cases of chunks are important? What would be a simple, textbook-style, translation of a plan using operators? What is ResultSet-style iteration? Is this restricted to rows or can we use it for columns as well? Would it make sense to iterate over pages?

Additional Material
Literature
5.3.3 Operator Example Implementations

How do we implement specific operators without unnecessarily breaking the pipeline? Is it hard to implement a projection or selection operator? How do we handle loops in operator implementations? What do we mean by `state' here?

Additional Material
Literature
5.3.4 Query Compilation

What is the performance problem with operators? How is the pipeline organized when compiling code? How do algebraic expressions translate into code? How to translate entire plans into LLVM/C++ code?

Material
Literature
Additional Material
Literature
Code Example
Code Example Slides
5.3.5 Anti-Projection, Tuple Reconstruction, Early and Late Materialization

What is early materialization? What is the relationship to column and vertically partitioned (column grouped) layouts? How would we do early materialization partially? What is an anti-projection? How could we implement early materialization in a column store? How could we implement late materialization in a column store? What is the impact on join processing? What is a join index? What is the impact on query planning?


6.1 Core Concepts

6.1.1 Crash Recovery, Error Scenarios, Recovery in any Software, Impact on ACID

What are possible error scenarios? What could be the impact of a power failure to the database buffer? What does the term `recovery' mean? Is recovery just important for database systems? What is the impact of error scenarios on ACID? What is a local error? How may it occur? What is redo? What is undo? What does losing main memory mean? What if we lose (some) external storage? How do we cope with this?

Additional Material
Literature
6.1.2 Log-Based Recovery, Stable Storage, Write-Ahead Logging (WAL)

What is `stable storage'? What is its relationship to the database store? What is the relationship to logging (as of video 14.163)? What is write-ahead logging (WAL)? What is and what is not allowed following WAL? What does WAL imply when committing transactions? What does WAL imply for the database buffer page eviction of dirty pages?

Additional Material
Literature
6.1.3 What to log, Physical, Logical, and Physiological Logging, Trade-Offs, Main Memory versus Disk-based Systems

What is physical logging? What are its pros and cons? What is an after image and a before image? And how big may those images become? What is logical logging? How does logical logging from physiological logging? How do log entries for the three variants typically look like? What are the performance trade-offs for the different logging variants? In a disk-based system? In a main-memory system? Why do the trade-offs differ so much in the two types of systems? How would you log data in a main memory system? What is the relationship to dictionary compression?

Additional Material
Literature

6.2 ARIES

6.2 ARIES

What are the three phases of ARIES? What is the purpose of the transaction table (TT)? What is the lastLSN? What is the purpose of the dirty page table (DPT)? What is the recoveryLSN? Which assumptions do we make in ARIES-recovery? What information is contained in a log record? In any log record? In update log records? What is the redo information and undo information? What is the relationship of LSNs and database pages in the store? How is TT maintained during logging? How is DPT maintained during logging? What is the general idea of a compensation log record (CLR)? What is the redoTheUndo information contained in it? Where does undoNextLSN point to? How do we shorten (prune) a log file? How do we shorten the analysis phase? How do we shorten the redo phase? How do we shorten the undo phase? How do we determine firstLSN? Where do we cut-off the log? How exactly is a fuzzy checkpoint written to the log file? What are the core tasks of the three phases of ARIES? What does `repeating history' mean in this context?

Material
Literature
Additional Material
Literature
Further Reading

7.1 Introduction to NoSQL

7.1 Introduction to NoSQL

What is a key-value database? What is an aggregate-oriented database? What is a column family? What is the value in a document database? What is the relationship of these aggregates and clustering, i.e. like in clustered indexes? When is an aggregate-oriented database a disadvantage? Are graph databases aggregate-oriented? Are relational databases good at managing relationships? What is BASE? What is the relationship of aggregates and transactions in aggregate-oriented databases? What is an offline lock? What is logical consistency? What is replication consistency? Is handling consistency a business choice? What is a network partition? In case of a network partition what does this imply for consistency and availability (following Fowler's explanation of CAP)? What is the trade-off between consistency and response time in a distributed system?Who should use NoSQL and what does that have to do with 64K?


preview

7.2 Introduction to Big Data

7.2 Introduction to Big Data

What is big data? What are applications of big data? What are volume, variety, velocity, and veracity? What do they mean? Is big data always big?


preview

7.3 Introduction and Recap of MapReduce

7.3 Introduction and Recap of MapReduce

What are the three different MapReduces? What are the semantics of the map() and reduce()-functions? What are the different types of Hadoop? What is HDFS? How is data stored on HDFS? How is a file partitioned and replicated? To which RAID-level does this correspond? Why would partitioning help for load balancing? What are the three phases of MapReduce? What is a mapper? What is a split? Where is the output of the map()-calls stored? Is the data of all copies mapped in the map phase? What does the shuffle phase do? How could that be described in SQL? On what data is reduce() called? Where is the result of the reduce phase stored?

Additional Material
Literature
Additional Recap Material (from undergrad lecture, in German)
Literature
Video
Slides

preview

Please select a section