Worksheet on the ADEME dataset

In this worksheet, you work with the ADEME dataset, a collection of 10k energy diagnostic of building in France. Each builing is labeled with a letter A to G. A and B graded buildings show top level energy efficiency while F and G labeled buildings are called "passoire energetique" (energy guzzler).

ADEME is the French government agency for ecological transition, supporting sustainability, energy efficiency, and environmental innovation. The DPE dataset is available at this address https://data.ademe.fr/datasets/dpe-v2-tertiaire-2. This dataset only conerns the tertiary sector: services, administrations etc

The whole dataset includes over 600k energy audits but we will only work on a subset of 10k samples.

This worksheet has multiple parts and will cover most of the course content including

  • normalization: NF1, NF2, NF3
  • index creation: B-tree and hash indexes
  • Optimizing quiries with EXPLAIN
  • window functions and CTEs
  • SQL and PL/pgSQL functions

Note: you do not have to answer all the questions. If you don't know skip ahead

As often the case in real world situations the dataset is far from being perfect.

Your mission is to understand the data, improve it and prepare it for production use.

Student Information

Please provide a valid email.
Please provide your name.
You must accept the privacy policy to continue.

Email:

Name:

Part A: Let's get started

In the 1st part of the project, you load the dataset into a newly created ademedb database.

The dataset is available as a postgresql dump on github at https://github.com/SkatAI/epitadb/blob/master/data/ademe_backup_01.dump. Download it.

This database has just one table called dpe with over 50 columns and a lot of null values

Connect to the postgresql server preferably in the terminal with psql and to the postgres database.

Use postgres as the user if you are on Windows, or the default system user if you're on a Mac.

We start by reviewing how to manage the server.

1. Check that your server PostgreSQL is running

Write down the command line that verifies if the server is running.

And also copy paste the command output.

2. Can you restart the server ?

Write down the command line that restarts the server, paste the command output

3. List the databases that are in your server

In a psql session use the \l command. paste the output.

Note there's no expectation of an exact list of database. It's more to verify that you can actually list the databases.

4. List the tables in the _postgres_ database

While connected to the postgres database, list the tables.

Copy Paste the raw output of your command below.

5. Create the database and load the data

We now create a new database ademedb. There are multiple ways do do it. Which ones are valid ?

6. Load the data

The database ademedb is now created (check with \l).

The data dump is available from github

Download the file if you haven't already.

This PostgreSQL dump file was created with: pg_dump -h localhost --no-owner --no-acl --clean --if-exists --format=custom --compress=9 --file=ademe_backup.dump ademedb

You can find the signification of the different flags in the pg_dump documentation

What is the equivalent pg_restore statement that you can use to restore the database ?

Part B: Let's explore the data

You should now have loaded the data in the ademedb database.

As you can notice, we have only one table with over 50 columns.

In this section you explore the dataset through queries and functions.

We start with dealing with Null values and then move on to detecting outliers.

IMPORTANT: numeric columns have a 0 value for null values. Text columns may have a Null value or the empty string '' for null values.

7. How many `dpe_label` have null values ?

The dpe_label is what justifies this whole dataset. If it's missing there's no point in keeping the record.

How many dpe_labels have null values ?

write a query to count the number of rows with null values for dpe_label. paste the number not the query.

8. any `ghg_label` without null values when `dpe_label` is null?

My guess is that when the dpe_label is null, the ghg_label is also null and vice versa.

Is that true ?

yes or no ?

9. remove the null `dpe_label`

Write and execute the query remove all the dpe_label with null values.

Paste your query

10. Write a function to check the other columns

At this point you should have 5608 rows and all the rows have a non null label. But there are probably other columns that have null values. We don't want to have to modify and run a query for each of the 53 columns.

Let's write a PL/pgSQL function that takes as argument:

  • table_name
  • column_name

and returns:

  • number of rows that are null or 0 (for numeric and ints) or empty string '' (for strings)
  • the percentage of number of null values over all the rows

Let's call it check_nulls(table_name, column_name)

detect column type: check out the information_schema.columns table to find the type of a column.

For instance: SELECT column_name, data_type FROM information_schema.columns WHERE table_name = 'dpe';

hint: you can use COUNT with FILTER as such COUNT(*) FILTER (WHERE ' || column_name || ' IS NULL)

paste your function.

11. List the number and percentage of null values for all columns

Getting the list of columns from the information_schema.columns table, write a CTE to output

  • column name
  • number of null rows (or 0 or '')
  • percentage of null rows

The subquery gets the column name and calls the check_nulls() function.

paste the CTE query

12. How many columns have over 80% null values?"

Looking at the output of the previous CTE, how many columns have more than 80% null values ?

13. Reindex the id

Since we've removed nearly half the rows from the original database, the values for the primary key are no longer sequential.

Write and apply the query to reindex the id values so that they start at 1 and increment 1 by 1.

hint: use row_number() OVER

paste the query, and apply it