DBA StackExchange Assignment

Use case
Stack Exchange is a network of websites that allow people to ask (and answer) questions about various subjects. Usually these subjects are of a technical nature. The most well known of these is StackOverflow.

The data that is used by these websites can be freely accessed. Stack Exchange uses a relational database to store their data. The large number of users however makes this a very large database. The complete database is about 60 GB when zipped.

For this assignment you will work with the data of DBA StackExchange, this is the website about network and server related questions. We have reduced this dataset to approximately 180 MB.

The goal of this assignment is to normalize the database and to write your own queries to retrieve the data for the Stack Exchange website. You will optimize the database structure to ensure the highest possible performance.

One of the tables contains the votetypeid attribute. The values in this column have the following meaning:

AcceptedByOriginator
UpMod
DownMod
Offensive
Favorite
Close
Reopen
BountyStart
BountyClose
Deletion
Undeletion
Spam
InformModerator

Setup

To do this assignment you will need to use the postgres-docker Docker composition.

Start the Docker composition: docker compose up
Copy the file adm_stackexchange_2425.gz to the folder import-data (in the docker composition folder).
Create a new database by running the following command: docker exec -i postgres-adm psql -U adm -c “CREATE DATABASE admexam;”.
Import the data by running the following command: docker exec -i postgres-adm sh -c ‘gunzip -c /import-data/adm_stackexchange_2425.gz | pg_restore -U adm -d admexam’.
Importing the data takes a little while…
Afterwards you can access the database using the PgAdmin included in the composition. For this to work, you’ll first need to register the database server in PgAdmin. See the docker-compose file for the hostname and credentials.

Assignment

Caution
after every structural change to the database (adding constraints, indexes, etc.), you will need to update the database engine and the statistic tables by running the query: VACUUM ANALYZE.

Part 1: Normalization
The current database is not normalized yet. It is your task to analyse the current database and come up with a proposal on how to normalize the database. Create an ERD for the normalized database and explain your design choices. Write the necessary SQL statements to transform the database into normalized form. Store your SQL script in 1_normalize.sql.

Part 2: Writing Queries
For each of the questions below, write down your query and perform the following operations:

Use the query plan to write your own explanation of how the query is executed.
Write down how long it takes to execute the query.
For the various pages of the website you are required to reconstruct their queries. The list below shows an overview of the queries that your have to write. Please write all your queries in a file called 2_queries.sql.

Questions
Searching: Search for questions or answers that have both ‘MySQL’ and any conjugation of the verb ‘to run’ in the title or and/or in the text.
Tags overview: Show all tags and per tag the number of times the tag has been used in a question. Sort the result descending by number of times used.
Users overview: Show all user information per user. Show also for each user the number of questions, number of answers, number of comments and the number of tags used.
Questions overview: Show all questions, sorted by date descending, the ID of the question, the first 200 characters of the contents, the vote score (number of upvotes – number of downvotes), number of answers to the question, number of comments on the question and on its answers.
Comments: Given a couple of post IDs, show the comments for these posts. Sort by post ID and also show the name of the author of the comment.
Adding answers: Add a new comment to the database on an existing answer.
Part 3: Creating Primary Keys and Foreign Keys
Write SQL statements that provide tables with primary and foreign keys using the ALTER TABLE command. Add all queries to the file called 3_pk_and_fk.sql.

Now repeat the queries from part two, analyze the new query plans and clearly describe the observed changes.

Part 4: Creating Indices
PostgreSQL will automatically create and index for primary keys. Foreign keys however omit this. An index proves a speed boost when reading data, however also generated additional overhead when inserting, updating or deleting data. It is therefore advised to careful when adding indices.

For this assignment we will look specifically at reading operations. In this case add an index for all foreign keys. Store all SQL instructions that add those keys in 4_indices.sql.

Besides the foreign keys, there are more fields that could use an index. Explain which fields need to have an index and add these indexes. Make sure the SQL is stored in the 4_indices.sql script.

Now repeat the queries from part two, analyze the new query plans and clearly describe the observed changes.

Part 5: Denormalization and Triggers
The database that we are using should now be normalized until the third normal form. No data is stored twice. It does make for some rather complex queries because data needs to be gathered from many different tables. Combining all that data has an enormous impact on the performance of the database. This becomes apparent when the database is used mainly for reading, and only rarely needs to process writing.

A solution to this problem is denormalization. So rather than actually calculating certain information as is proper, we are going to create a cached value.

In order to avoid inconsistency we are going to implement this information with triggers. This is a query that executed whenever an INSERT, UPDATE or DELETE is performed.

Analyse your database and explain which fields from the database should be denormalized (and why). Write SQL to transform your database into the desired form:

5a_denormalization.sql: contains the ALTER TABLE statements that you need to update the structure of the tables.
5b_insert.sql: contains the INSERT or UPDATE statements that you will need to fill the newly added columns with the right values.
5c_triggers.sql: contains the code that you will need to keep the data consistent in the database. The database should always be consistent after INSERT, UPDATE or DELETE operations.
Rewrite the queries of part 2, so that the newly cached fields are used. Store these queries in 5d_new_queries.sql. Store the query plans and compare to the earlier query plans.

What to hand in?
A short report containing:
An ERD of the normalized database.
An ERD of the optimized database (part 5).
Query plans and measuring results of the different parts of this assignment including explanations and a conclusion.
All .sql files of the different parts.
Test if the database can be transformed from the initial state to your final state by executing the .sql files in sequential order.