Worksheet on the ADEME dataset - Part B normalization
This worksheet follows part A of the work on the ADEME dataset, a collection of 10k energy diagnostic of building in France.
It's best to start with Part A to become familiar with the data and its quirks.
The Ademe dataset
ADEME is the French government agency for ecological transition, supporting sustainability, energy efficiency, and environmental innovation. The DPE dataset is available at this address https://data.ademe.fr/datasets/dpe-v2-tertiaire-2. This dataset only conerns the tertiary sector: services, administrations etc
Each builing is labeled with a letter A to G. A and B graded buildings show top level energy efficiency while F and G labeled buildings are called "passoire energetique" (energy guzzler).
As often the case in real world situations the dataset is far from being perfect.
In part B, we work on a subset of 5.6k samples out of 600k+ samples contained in the complete dataset.
The database so far
At this point, the database ademedb
has only one table called dpe
with 32 columns.
In part A we removed the rows that were useless. Leaving us with 5608 rows.
Your mission is to normalize the databae and prepare it for production use.
Load the db
The database should be laoded yet, but you can restore the dump file available on github. Choose the ademe_backup_01.dump
file.
The data can be restored with
pg_restore
--no-owner
--no-acl
--clean
--if-exists
--dbname=ademedb
ademe_backup_01.dump
If you already have the database availbe, and you';ve done part A of the quiz, you don't have to restire the database.