Saarland University Saarland University | Department of Computer Science

Infosys Group

Database Systems SoSe 2011



(18-10-11) Busy Beaver Award for best advanced lecture in computer science.

We are flooded with data be it data on the Web (html pages, twitter, facebook, map services, ...), structured data in databases (your money on bank accounts, addresses, cell phone data, school and uni grades, flight information, taxes, medical records, ...), or data in scientific applications (gene data in bioinformatics, telescope data in astronomy, collider data in physics, measurements of seismic activity in geology, ...).

The way we access, manage, and process that data has tremendous impact on:

  1. performance. Though we sometimes think that a performance problem is due to particular algorithm requiring too much CPU time, it is sometimes the data access patterns and retrieval times that slow down a program. The reason for bad performance may be that data cannot be accessed and shipped fast enough to the CPU. For instance you may be using unsuitable access methods (i.e. indexes) to retrieve a single piece of information from a large data repository. Or you might be using an inefficient data layout ignoring the memory hierarchy and hardware capabilities of modern processors. In addition, even if the data was efficiently retrieved, performance may suffer due to picking the wrong analytical algorithms.
  2. reliability. What happens if your hard disk fails or your data center is flooded with water? How do you make sure that a consistent version of your data is accessible at all times? Can you afford to lose all data? How do you exploit multi-threading for accessing data without corrupting you data repository?

If you are interested in these questions, this might be the right lecture for you.

In this core lecture you will learn how to answer these questions. You will learn fundamental data managing algorithms and techniques which are used to build (not only) database systems but also search engines (Google), file systems, data warehouses, publish/subscribe systems (like Twitter), streaming systems, map services (google maps), or Amazon's Cloud (EC2), etc.

These techniques and algorithms will allow you to design, plan, and build (almost) any kind of data managing system.

Topics will include:

  • storage media (tape, disk, flash, main memory)
  • data managing architectures (DBMS, streams, file systems, clouds, appliances)
  • storage management (DB-file systems, raw devices, write-strategies, differential files, buffer management)
  • data layout (horizontal and vertical partitioning, columns, hybrid mappings, compression, free memory management, defragmentation)
  • indexing (one- and multidimensional, tree-structured, hash-, partition-based, bulk-loading and external sorting, differential indexing, LSM and stepped merge trees, read- and write-optimized indexing, data warehouse indexing, text indexing, main-memory indexes, sparse and dense, direct and indirect, clustered and unclustered, main memory versus disk and/or flash-based)
  • operator models (push- and pull-model, block-based, vectorized)
  • operator implementations (join algorithms for relational, spatial, and multidimensional data, grouping and early aggregation, filtering)
  • query processing (scanning, plan computation, SIMD)
  • query optimization (query rewrite, cost models, cost-based optimization, join order, plan enumeration)
  • data recovery (single versus multiple instance, logging, ARIES, hot stand-by)
  • parallelization of data and queries (horizontal and vertical partitioning, shared-nothing, replication, distributed query processing, 2PC, map-reduce and hadoop)
  • read-optimized system concepts (search engines, data warehouses, OLAP, ad-hoc analytics)
  • write-optimized system concepts (OLTP, streaming data, moving objects)
  • management of geographical data (GIS, google maps)
  • management of high-dimensional data (OLAP, multi-dimensional indexing trade-offs)

Place and Time:

  • MPII 0.24
  • Wed and Fri from 10:15 to 12:00
  • Further information in HIS

Rules:

  • Requirements: Students are expected to have successfully passed the Information System introductory lecture held every summer semester covering introductory relational database management concepts or a comparable lecture.
  • Assignments: There will be weekly assignments (about 10 in total). Your performance in these assignments is not part of your overall grade. However, you have to obtain 50% of the points of all assignments overall. You may not have more than 3 assignments with 0 points (due to not handing-in, due to not presenting successfully, or due to not solving anything, ...).
    We strongly recommend that you team-up to build a learning group. You may hand-in as a group of 2-3 people. We expect one write-up per group. At the same time we expect that each member of the group contributes and understands each solution. If for whatever reason in any particular week you could not contribute to a particular question on an exercise sheet, it is fine to mark that on your hand-in. In that case you will not receive any points for that particular question, however you may receive points for all other questions. However, you do not risk to lose all points for this hand-in due to failed presentations (see Exercise Groups).
  • Exercise Groups: We highly recommend to participate in the exercise groups regularly in order to pass the lecture. However, attendance is not compulsary (with exceptions, see below). The material covered and hints given in the exercise groups will be indispensable to successfully pass the exams. Think of the exercises as mini-exams (even though they are not officially exams). The exercises will be very close to what you have to solve in the exams anyway.
    You have to successfully present the solution to one of the exercises at least once throughout the semester. If you cannot present what you handed-in (in other words: if the TA has serious doubts that you understand what you are presenting even though you handed-in a correct solution...), you will not receive any points for that exercise sheet in that week. In that case you have to try again to successfully present another exercise in one of the three following weeks and must attend all three following tutorials. Similarly, if you do not show up at the exercise groups frequently, the TA may make attendance for you compulsary to enable him to randomly pick you for presenting eventually. This will allow for a fair evaluation of your exercise performance.
  • Project: You will be grouped into small teams of 2-3 students and implement a small project throughout the semester: a main-memory database system. You may use only Java for coding plus a set of pre-defined libraries (see project definition).
    There will be three milestones (Dates: MS1: 18.5.; MS2: 15.6.; MS3: 13.7. at 23:59 each, all milestones postponed by 3 days compared to the initial announcement) plus a final hand-in (Date: 7.8. at 23:59) throughout the semester.
    For each milestone there will be a number of functional and performance tests. The outcome of the functional tests define 3% of your milestone grade, the performance test another 3%, i.e. in total 6% for each milestone. Notice that performance tests for your program will only be executed in case the corresponding functional test does not produce any errors. Performance tests for the three milestone will be executed after the final hand-in only. However, after each milestone we will give you feedback already on ther performance of your solution. The final project hand-in defines another 12%, i.e. in total 6+6+6+12 = 30%.
    In case you do not hand-in a milestone or the final project on time, you do not receive any points for that milestonne. Late hand-in will not be considered. Please do not ask for extensions. Rather try to hand-in whatever you have than not handing-in at all. Only SVN hand-ins will be considered. Please do not send your results by email. The final hand-in of your project includes a brief oral presentation where each team member is expected to briefly show what she or he contributed to the project. You need to pass this evaluation to pass the lecture.
    Yet there will be individual project grades for the team members. Overall you have to obtain 50% of the points in the project. Then the project counts 30% of your overall grade.
    We will provide you with an svn-repository plus an automatic testing and performance evaluation environment.
  • Exams: there will be three exams: one 60 minute mid term exam (Date: May 28, 11.30am-12:30pm), one 120 minute end term exam (Date: July 27, 10:15am-12:15pm), and one 120 minute repetition exam (Date: Oct 5, 10:15am-12:15pm). The midterm counts 20% of your overall grade. There is no minimal amount of exercise points you need to obtain in order to participate in the midterm exam. In addition, there is no minimal amount of points you have to obtain in the midterm exam. However, you cannot repeat the midterm exam.
    In contrast, you need to obtain at least 50% of the points in the exercises in order to be allowed to participate in the final or repetition exam. You need to pass either the final or repetition exam. You need to obtain at least 50% of the points in one of these exams. The best grade of your performance in the final and repetition exam grade determines 50% of your overall grade for this course.
  • Overall Grade: 20% midterm + 50% max(final exam, repetition exam) + (6+6+6+12)% project .
  • Requirements for passing:
    • 50% of points in exercises,
    • one successful presentation in exercise group,
    • 50% of points in project,
    • 50% of points in either final or repetition exam.

Lecture Notes:

  • July 15: Lecture 25 Pass 3: Current Research in Databases (continued), Third Pass, what's next?
  • July 13: Lecture 24 Pass 2: Pig Latin, Pig, Current Research in Databases
  • July 8: Lecture 23 Pass 2: Hadoop++ (continued)
  • July 6: Lecture 22 Pass 2: MapReduce (continued), Hadoop, Hadoop++
  • July 1: Lecture 21 Pass 2: parallelization of data and queries (continued), MapReduce (basics)
  • June 29: Lecture 20 Pass 2: crash recovery (continued), crash recovery in main memory DBMS, project discussion, parallelization of data and queries
  • June 24: Lecture 19 Pass 2: crash recovery (continued), ARIES, 3 Phases; franklin article (Section 2.2 and 3.2 are about recovery)
  • June 22: Lecture 18 Pass 2: starting second pass, possible topics, recap of transaction management, crash recovery
  • June 17: Lecture 17 Pass 1: join algorithms (continued), aggregation and grouping, join order, MS3 guidelines and discussion
  • June 10: Lecture 16 Pass 1: join algorithms (continued), simple hash join, grace hash join, double-pipelined hash join; evaluation result
  • June 8: Lecture 15 Pass 1: external sorting (continued), replacement selection, join algorithms
  • June 3: Lecture 14 Pass 1: pipelining and operators (continued), pipelining execution models, external sorting (basic idea)
  • June 1: Lecture 13 Pass 1: overview on query optimizer (continued), rule-based optimization, access paths, cost-based optimization (overview), pipelining, operators
  • May 27: Lecture 12 Pass 1: extendible hashing (continued), linear hashing, when to use which index, overview on query optimizer
  • May 25: Lecture 11 Pass 1: static hashing (continued), order preserving hashing, page-oriented hashing, dynamic hashing
  • May 20: no lecture due to illness
  • May 18: Lecture 10 Pass 1: brief recap compression in main memory, differential indexing, log-structured indexing, evaluating performance, static hashing
  • May 13: Lecture 9 Pass 1: the many faces of B+-trees (continued), read-optimized trees, trees versus partitioning, B-trees in main memory
  • May 11: Lecture 8 Pass 1: addressing rows, data structures/indexes, the many faces of B+-trees, ObjectTest.java [13.5.11] updated: improved colors (leave nodes) to be consistent with May 13 lecture
  • May 06: Lecture 7 Pass 1: more on compression, addressing pages, virtual memory management
  • May 04: Lecture 6 Pass 1: column layouts, implicit versus explicit, fractured mirrors, vertical partitioning, partial redundancy, PAX, compression
  • April 29: Lecture 5 Pass 1: writing, differential files, logged writes, row layouts
  • April 27: Lecture 4 Pass 1: more RAID, data redundancy, main memory, reading and buffering data
  • April 20: Lecture 3 Pass 1: hard disks, SSDs, RAID
  • April 15: Lecture 2 Pass 1: administrative questions, recursive organization of this lecture, hardware
  • April 13: Teaser: Why are databases so cool?, History of DBMS, Structure of this lecture

Recommended Literature:

Exercises:

  1. Exercise Groups
    • (14.04.11) If you registered and received a confirmation link "https://islecture.cs.uni-saarland.de/database11//confirm.php?id=something" please change "database11" to "dbms11". Then it will work. If you have not registered yet, just register. The system will send the correct link.
    • If you want to participate in this lecture, please register at our lecture system (all students, including Erasmus and Wirtschaftsinformatik). Please mark at least two exercise groups as "possible" or "optimal". Not all of these exercise groups will be offered. We will offer exercise groups to best fit your choices. Lecture System
    • Dr. Jorge Quiane (Chief TA)
    • Stefan Schuh
    • Mathias Bader
  2. Assignments:

  3. Sample Solutions:

Exams:

  • (01.06.11) midterm exam inspection on June 9, seminar room of IS group, E1 1, 2nd floor. If your last name starts with A-M -> 11:00am till noon. If your last name starts with N-Z -> noon to 1:00pm.

Project: