Difference between revisions of "2015 UK General Election Normalising Data"

Jump to: navigation, search
(Implement the tables)
(county has a reference to region)
Line 77: Line 77:
Notice that we take care to indicate NOT NULL on the foreign key - this is because the relationship is not optional. Every county MUST HAVE a region.
*we take care to indicate NOT NULL on the foreign key - this is because the relationship is not optional. Every county MUST HAVE a region.
==constituency references county==
==constituency references county==
And now constituency which references county_name:
And now constituency which references county_name:

Revision as of 22:46, 19 October 2016

The unnormalised data from the CSV file looks like this:


Notice that it includes redundancy - constituency names are repeated for example. We can see that link between the ons_id and the constituency_name is repeated over and over.

Pick a primary key

Our first job is to identify a primary key. There are several options to consider:

  • First we notice that there is no single column that is unique and so we will need to use at least a pair of columns.
  • (firstname,surname) would be good
    • The candidates are unique. By law, no one is allowed to stand as a candidate in two constituencies
    • Unfortunately candidate names are not unique - there are two candidates called "Alan Johnson" for example. You can confirm this with a query such as
      SELECT firstname,surname COUNT(1) FROM ge HAVING COUNT(1)>1
  • The combination (ons_id,party) is also tempting
    • No party will put up two candidates in the same constituency, that would be self defeating and against the rules.
    • Unfortunately there are independent candidates with a NULL party and we cannot have NULL values in the primary key
  • It turns out that the triple (ons_id,firstname,surname) is unique. You may not have more than one candidate in a constituency with the same first name and surname. This would be confusing for voters. We can verify that this is a safe choice with a query such as
    SELECT firstname,surname,ons_id FROM ge GROUP BY firstname,surname,ons_id HAVING COUNT(1)>1;

Identifying dependencies

The columns headins are:

ons_id	ons_region_id	constituency_name	county_name	region_name	country_name	constituency_type	party_name	party_abbreviation	firstname	surname	gender	sitting_mp	former_mp	votes	share	change


Having decided on our primary key as (ons_id, firstname, surname) we notice the following dependencies:

ons_id                   -> ons_region_id
ons_id                   -> consitituency_name
ons_id                   -> county_name
ons_id                   -> constituency_type
county_name              -> ons_region_id
ons_region_id            -> region_name
ons_region_id            -> country_name
party_abbreviation       -> party_name
ons_id,firstname,surname -> gender
ons_id,firstname,surname -> party_abbreviation
ons_id,firstname,surname -> sitting_mp
ons_id,firstname,surname -> former_mp
ons_id,firstname,surname -> votes
ons_id,firstname,surname -> share
ons_id,firstname,surname -> change

Decide on tables

Each distinct determiner (the left hand side of the -> above) will be a table. The determiner will be the primary key in each case.

Bold indicates a primary key, italics indicates a foreign key.

  • constituency(ons_id, constituency, county_name, contituency_type)
  • county(county_name, ons_region_id)
  • region(ons_region_id, region_name, country_name)
  • party(party_id, party_name)
  • candidate(ons_id, firstname, surname, gender, party_id, sitting_mp, former_mp, votes, share, change)

Implement the tables

We need to start with the tables that do not have out-going foreign keys.

party and region are simple tables


  party_id VARCHAR(50) PRIMARY KEY,
  party_name VARCHAR(50)


  ons_region_id VARCHAR(10) PRIMARY KEY,
  region_name VARCHAR(50),
  country_name VARCHAR(50)

county has a reference to region

Now that we have the foreign key target in place we can introduce county which refers to ons_region_id

  county_name VARCHAR(50) PRIMARY KEY,
  ons_region_id VARCHAR(10) NOT NULL,
  FOREIGN KEY (ons_region_id) REFERENCES region(ons_region_id)


  • we take care to indicate NOT NULL on the foreign key - this is because the relationship is not optional. Every county MUST HAVE a region.

constituency references county

And now constituency which references county_name:

CREATE TABLE constituency(
  constituency VARCHAR(50) NOT NULL UNIQUE,
  county_name VARCHAR(10) NOT NULL,
  contituency_type VARCHAR(10)
                   NOT NULL
                   CHECK (constituency_type IN ('county','borough')),
  FOREIGN KEY (county_name) REFERENCES county(county_name)


  • There are exactly two constituency types, roughly county is in the countryside; borough is in the city.
  • constituency names must be unique

candidate is the complicated table

And finally candidate:

CREATE TABLE candidate(
  ons_id VARCHAR(10),
  firstname VARCHAR(50),
  surname VARCHAR(50),
  gender VARCHAR(10) NOT NULL,
  party_id VARCHAR(50) NULL,
  sitting_mp VARCHAR(3) NOT NULL CHECK (sitting_mp IN ('Yes','No')),
  former_mp VARCHAR(3) NOT NULL CHECK (former_mp IN ('Yes','No')),
  votes INT  NOT NULL,
  `change` FLOAT NULL,
  PRIMARY KEY (ons_id,firstname,surname),
  FOREIGN KEY (ons_id) REFERENCES constituency(ons_id),
  FOREIGN KEY (party_id) REFERENCES party(party_id)


  • change is a reserved word and must be enclosed in back-ticks
  • party can be NULL, we will replace the word Independent with NULL
  • change may be NULL

DROP TABLE candidate;DROP TABLE constituency;DROP TABLE county; DROP TABLE region; DROP TABLE party;