Difference between revisions of "2015 UK General Election Normalising Data"

From SQLZOO
Jump to: navigation, search
(candidate is the complicated table)
(Add the party data)
Line 148: Line 148:
 
   GROUP BY party_abbreviation;
 
   GROUP BY party_abbreviation;
 
</pre>
 
</pre>
 
+
==INSERT county data==
Similarly
+
<pre>
 +
INSERT INTO county
 +
  SELECT DISTINCT county_name,ons_region_id
 +
    FROM ge;
 +
</pre>
 +
==INSERT constinituencies==
 +
INSERT INTO constituency
 +
  SELECT DISTINCT ons_id, constituency_name, county_name, constituency_type
 +
    FROM ge;
 +
</pre>

Revision as of 23:09, 19 October 2016

The unnormalised data from the CSV file looks like this:

Ge2015Data.png

Notice that it includes redundancy - constituency names are repeated for example. We can see that link between the ons_id and the constituency_name is repeated over and over.

Pick a primary key

Our first job is to identify a primary key. There are several options to consider:

  • First we notice that there is no single column that is unique and so we will need to use at least a pair of columns.
  • (firstname,surname) would be good
    • The candidates are unique. By law, no one is allowed to stand as a candidate in two constituencies
    • Unfortunately candidate names are not unique - there are two candidates called "Alan Johnson" for example. You can confirm this with a query such as
      SELECT firstname,surname COUNT(1) FROM ge HAVING COUNT(1)>1
  • The combination (ons_id,party) is also tempting
    • No party will put up two candidates in the same constituency, that would be self defeating and against the rules.
    • Unfortunately there are independent candidates with a NULL party and we cannot have NULL values in the primary key
  • It turns out that the triple (ons_id,firstname,surname) is unique. You may not have more than one candidate in a constituency with the same first name and surname. This would be confusing for voters. We can verify that this is a safe choice with a query such as
    SELECT firstname,surname,ons_id FROM ge GROUP BY firstname,surname,ons_id HAVING COUNT(1)>1;

Identifying dependencies

The columns headins are:

ons_id	ons_region_id	constituency_name	county_name	region_name	country_name	constituency_type	party_name	party_abbreviation	firstname	surname	gender	sitting_mp	former_mp	votes	share	change

Ge2015Data.png

Having decided on our primary key as (ons_id, firstname, surname) we notice the following dependencies:

ons_id                   -> ons_region_id
ons_id                   -> consitituency_name
ons_id                   -> county_name
ons_id                   -> constituency_type
county_name              -> ons_region_id
ons_region_id            -> region_name
ons_region_id            -> country_name
party_abbreviation       -> party_name
ons_id,firstname,surname -> gender
ons_id,firstname,surname -> party_abbreviation
ons_id,firstname,surname -> sitting_mp
ons_id,firstname,surname -> former_mp
ons_id,firstname,surname -> votes
ons_id,firstname,surname -> share
ons_id,firstname,surname -> change

Decide on tables

Each distinct determiner (the left hand side of the -> above) will be a table. The determiner will be the primary key in each case.

Bold indicates a primary key, italics indicates a foreign key.

  • constituency(ons_id, constituency, county_name, contituency_type)
  • county(county_name, ons_region_id)
  • region(ons_region_id, region_name, country_name)
  • party(party_id, party_name)
  • candidate(ons_id, firstname, surname, gender, party_id, sitting_mp, former_mp, votes, share, change)

The Entity Relationship Diagram for this database is: Ge2015.png

Implement the tables

We need to start with the tables that do not have out-going foreign keys.

party and region are simple tables

Party:

CREATE TABLE party(
  party_id VARCHAR(50) PRIMARY KEY,
  party_name VARCHAR(50)
);

Region:

CREATE TABLE region(
  ons_region_id VARCHAR(10) PRIMARY KEY,
  region_name VARCHAR(50),
  country_name VARCHAR(50)
);

county has a reference to region

Now that we have the foreign key target in place we can introduce county which refers to ons_region_id

CREATE TABLE county(
  county_name VARCHAR(50) PRIMARY KEY,
  ons_region_id VARCHAR(10) NOT NULL,
  FOREIGN KEY (ons_region_id) REFERENCES region(ons_region_id)
);

Notice

  • we take care to indicate NOT NULL on the foreign key - this is because the relationship is not optional. Every county MUST HAVE a region.

constituency references county

And now constituency which references county_name:

CREATE TABLE constituency(
  ons_id VARCHAR(10) PRIMARY KEY,
  constituency VARCHAR(50) NOT NULL UNIQUE,
  county_name VARCHAR(10) NOT NULL,
  contituency_type VARCHAR(10)
                   NOT NULL
                   CHECK (constituency_type IN ('county','borough')),
  FOREIGN KEY (county_name) REFERENCES county(county_name)
);

Notice:

  • There are exactly two constituency types, roughly county is in the countryside; borough is in the city.
  • constituency names must be unique

candidate is the complicated table

And finally candidate:

CREATE TABLE candidate(
  ons_id VARCHAR(10),
  firstname VARCHAR(50),
  surname VARCHAR(50),
  gender VARCHAR(10) NOT NULL,
  party_id VARCHAR(50) NULL,
  sitting_mp VARCHAR(3) NOT NULL CHECK (sitting_mp IN ('Yes','No')),
  former_mp VARCHAR(3) NOT NULL CHECK (former_mp IN ('Yes','No')),
  votes INT  NOT NULL,
  share FLOAT NOT NULL,
  `change` FLOAT NULL,
  PRIMARY KEY (ons_id,firstname,surname),
  FOREIGN KEY (ons_id) REFERENCES constituency(ons_id),
  FOREIGN KEY (party_id) REFERENCES party(party_id)
);

Note

  • change is a reserved word and must be enclosed in back-ticks
  • party can be NULL, we will replace the word Independent with NULL
  • change may be NULL

DROP TABLES

If you need to start again you must drop the tables in the reverse order. Last created is the first dropped.

DON'T DROP THE TABLES!

DROP TABLE candidate;
DROP TABLE constituency;
DROP TABLE county;
DROP TABLE region;
DROP TABLE party;

Add the region data

INSERT INTO region
  SELECT DISTINCT ons_region_id, region_name, country_name
    FROM ge;

Add the party data

Annoyingly the abbreviation Lab has been used for two different parties (they are the same party really - they just don't admit it).

That means we cannot use the DISTINCT method. Using MIN will arbitrarily assign the first value (alphabetically) to the pesky "Labour and Co-operative" party members.

INSERT INTO party
  SELECT party_abbreviation, MIN(party_name)
    FROM ge
   GROUP BY party_abbreviation;

INSERT county data

INSERT INTO county
  SELECT DISTINCT county_name,ons_region_id
    FROM ge;

INSERT constinituencies

INSERT INTO constituency

 SELECT DISTINCT ons_id, constituency_name, county_name, constituency_type
   FROM ge;

</pre>