Worksheet on the ADEME dataset - Part B normalization

This worksheet follows part A of the work on the ADEME dataset, a collection of 10k energy diagnostic of building in France.

It's best to start with Part A to become familiar with the data and its quirks.

The Ademe dataset

ADEME is the French government agency for ecological transition, supporting sustainability, energy efficiency, and environmental innovation. The DPE dataset is available at this address https://data.ademe.fr/datasets/dpe-v2-tertiaire-2. This dataset only conerns the tertiary sector: services, administrations etc

Each builing is labeled with a letter A to G. A and B graded buildings show top level energy efficiency while F and G labeled buildings are called "passoire energetique" (energy guzzler).

As often the case in real world situations the dataset is far from being perfect.

In part B, we work on a subset of 5.6k samples out of 600k+ samples contained in the complete dataset.

The database so far

At this point, the database ademedb has only one table called dpe with 32 columns. In part A we removed the rows that were useless. Leaving us with 5608 rows.

Your mission is to normalize the databae and prepare it for production use.

Load the db

The database should be laoded yet, but you can restore the dump file available on github. Choose the ademe_backup_01.dump file.

The data can be restored with pg_restore --no-owner --no-acl --clean --if-exists --dbname=ademedb ademe_backup_01.dump

If you already have the database availbe, and you';ve done part A of the quiz, you don't have to restire the database.

Student Information

Email *

Please provide a valid email.

Name *

Please provide your name.

I accept the privacy policy and terms *

You must accept the privacy policy to continue.

A refresher on normal forms

We saw the goal of normalizing a database is

to eliminate data redundancy and ensure data consistency by organizing data into well-structured tables with clear relationships

or simply put: Normalization helps reduce data chaos by making sure each piece of information is stored exactly once, in the right place.

There are 3 rules which act as guidelines : 1NF, 2NF, 3NF. Let's refresh your memory on 1NF.

Which of the following statements about First Normal Form (1NF) are true?

A) Each attribute must contain atomic (indivisible) values - no arrays, sets or nested structures

B) Columns must be ordered alphabetically to achieve First Normal Form

C) There should be no repeating groups of columns (like item1, item2, item3)

D) All text values must be stored in uppercase or CamelCase to maintain data atomicity

Normalize energy consumption

The 6 following columns are in obvious breach of 1NF.

final_energy_cons_01
final_energy_cons_02
final_energy_cons_03

and

primary_energy_cons_01
primary_energy_cons_02
primary_energy_cons_03

It would be so much better, readable and concise to have all these columns regrouped in just table energy_consumption.

In the following, we will normalize the dpe table by moving these 6 columns in a new energy_consumption tables.

Let's start by creating the new energy_consumption table:

with the following columns: - id: the primary key (serial) - dpe_id: the id of the dpe row. - a priority level: 1, 2, 3 - a measure_type : 'final' or 'primary' - the actual consumption_value

Let's deal with the final_energy_cons columns first

The idea is to select from each column, transforming the column name into the proper columns of energy consumption table and the use UNION ALL to aggregate all the results for all the columns.

For instance, with the column final_energy_cons_01 : sql SELECT id as dpe_id, 1 as priority, 'final' as measure_type, final_energy_cons_01 as consumption_value FROM dpe WHERE final_energy_cons_01 > 0

Write a CTE where the subquery is a UNION ALL of all these column dependant queries and the main query of the CTE will be a select * from the subquery.

Remember that the missing values are not null but = 0. We don't want to import them.

Anyway, enough rambling, good luck!

At the end of everything you should have 17100 rows in the energy_consumption table

When you're done, please write the CTE that imports the 6 values from the 6 columns into the energy_consumption table.

How many buildings have no energy consumption?

write a query that returns the list of buildings that have no record in the energy_consumption table.

How many do you find ?

832

2235

How many buildings have no energy consumption?

You can now drop the 6 columns:

finalenergycons_01
finalenergycons_02
finalenergycons_03
primaryenergycons_01
primaryenergycons_02
primaryenergycons_03

write the sql query (ALTER TABLE ...) that actually drops these columns the the table dpe

What about the remining denormalized columns ?

we can export all the other energytype, energyusagetype, annualenergycost andd energyreading_year columns to a new table called energy_use and all the ban related columns to a ban table.

You can do it if you want to or just grab the SQL queries available on github and fast forward

What are the labels definitions ?

Our question now is whether the DPE and GHG labels are reliable.

The classification rules from A to G are defined in this image: DPE seuils etiquettes

Let's create a table that references these thresholds

Run the following queries CREATE TABLE energy_labels ( grade CHAR(1) PRIMARY KEY, min_cons NUMERIC, max_cons NUMERIC, emission_min NUMERIC, emission_max NUMERIC );

and add the values INSERT INTO energy_labels (grade, min_cons, max_cons, emission_min, emission_max) VALUES ('A', 0, 70, 0, 6), ('B', 71, 110, 7, 11), ('C', 111, 180, 12, 30), ('D', 181, 250, 31, 50), ('E', 251, 330, 51, 70), ('F', 331, 420, 71, 100), ('G', 421, NULL, 101, NULL);

Are the labels reliable ?

Now use these official rules to analyze the vericity of the labels in the database.

Check the actual values of energy_use_kwhepm2year and ghg_emissions_kgco2m2year to detect - how many DPE labels are good - how many DPE labels are slightly off by 1 label (B instead of A, E instead of F or D, ..) - how many DPE labels are off by more than 1 label (B instead of A, E instead of F or D, ..)

Write down your findings and analysis method below