2015 UK General Election using SQL Server

From SQLZOO
Revision as of 06:32, 5 July 2017 by Andr3w (talk | contribs) (Attempt import the csv file into your flat table)
Jump to: navigation, search

This tutorial assumes you have access to powershell and sqlcmd or Microsoft SQL Server Management Studio

Get to the powershell prompt and download the csv

Invoke-WebRequest http://researchbriefings.files.parliament.uk/documents/CBP-7186/hocl-ge2015-results-full.csv -OutFile ge2015.csv

Go into sqlcmd or SSMS and create a table for the results

Here you create a single flat table that can store all of the unnormalised data.

The first line of the CSV file contains the column headings (ons_id,ons_region_id,constituency_name,county_name,region_name,country_name,constituency_type,party_name,party_abbreviation,firstname,surname,gender,sitting_mp,former_mp,votes,share,change). We create a table with a column for each of these:

sqlcmd -S .\sqlexpress -E

The create table statement could be:

CREATE DATABASE gisq
GO
USE gisq
GO
CREATE TABLE ge(
  ons_id VARCHAR(10),
  ons_region_id VARCHAR(10),
  constituency_name VARCHAR(50),
  county_name VARCHAR(50),
  region_name VARCHAR(50),
  country_name VARCHAR(50),
  constituency_type VARCHAR(10),
  party_name VARCHAR(50),
  party_abbreviation VARCHAR(50),
  firstname VARCHAR(50),
  surname VARCHAR(50),
  gender VARCHAR(6),
  sitting_mp VARCHAR(3),
  former_mp VARCHAR(3),
  votes INT,
  share FLOAT,
  change VARCHAR(20),
  PRIMARY KEY(ons_id,firstname,surname)
)
GO

Attempt import the csv file into your flat table

You can import the data using this line (from SQLCMD or SQL Server Management Studio)

BULK INSERT ge FROM 'C:\db\ge2015.csv' WITH (FIELDTERMINATOR=',', ROWTERMINATOR='\n', FIRSTROW=2)
GO
  • Each line is ended with carriage return \n
  • The first row contains column headings not data so we start at row 2

You will most likely get error messages like this:

Msg 4863, Level 16, State 1, Server ME1C039-130368\SQLEXPRESS, Line 3
Bulk load data conversion error (truncation) for row 105, column 13 (sitting_mp).
Msg 4863, Level 16, State 1, Server ME1C039-130368\SQLEXPRESS, Line 3
Bulk load data conversion error (truncation) for row 106, column 12 (gender).
Msg 4863, Level 16, State 1, Server ME1C039-130368\SQLEXPRESS, Line 3
Bulk load data conversion error (truncation) for row 107, column 13 (sitting_mp).
...

Unfortunately BULK INSERT cannot deal with the CSV format so we need to preprocess the file. Some solutions:

  • Use Excel - you can load CSV and save as TXT
  • Use regular expressions: convert CSV to TXT
  • Use python
  • Find a converter online

Identifying the problem rows

You can look at the data with a text editor or use powershell. We load all the raw text into $ge1 - we plan to fix the data stage by stage.

$ge = get-content "$pwd\ge2015.csv"
$ge[104]

Which shows

S14000006,S92000003,"Ayr, Carrick and Cumnock",Scotland,Scotland,Scotland,County,Scottish National Party,SNP,Corri,Wilson,Female,No,No,25492,0.488268306,0.307935794

(The error was reported on 105 but because we skipped row 1 the actual first problem is on 104) Observe that the constituency name has a comma in it - BULK INSERT is interpreting this as a field delimiter.

Fixing the problem rows

The best way to fix this is to use someone else's code to deal with CSV - but if we are careful you can do this yourself with regular expressions. Later versions of SQL Server allow text delimiters to be defined, python has a csv module.

You can use replace - we can replace the , inside a double quote using a regular expression:

$ge[104] -replace ',"(.*),(.*)"', ',"$1|$2"'

shows this

S14000006,S92000003,"Ayr| Carrick and Cumnock",Scotland,Scotland,Scotland,County,Scottish National Party,SNP,Corri,Wilson,Female,No,No,25492,0.488268306,0.307935794

You can apply this to the whole file using

$ge = $ge -replace ',"(.[^"]),(.[^"])"', ',"$1|$2"'

Then write the file out to ge2.csv

$ge > "$pwd\ge2.csv"

You can try loading this into the table ge again - but take care to delete any rows you created last time.

sqlcmd -S .\sqlexpress -E -d gisq -Q "BULK INSERT ge FROM '$pwd\ge2.csv' WITH (FIELDTERMINATOR=',', ROWTERMINATOR='\n', FIRSTROW=2)"

More problems

You will probably find further problems:

  • there is a constituency with two commas in it
  • your data still includes double quote marks that should be replaced
  • you have | marks in your data that should be replaced with commas

You can deal with these problems either in powershell, a capable text editor, excel or in SQL

Here is a solution in powershell:

$ge1 = get-content "$pwd\ge2015.csv"
$ge1 = $ge1 -replace ',"([^"]*),([^"]*)"', ',"$1|$2"'
$ge1 = $ge1 -replace ',"([^"]*),([^"]*)"', ',"$1|$2"'
$ge1 = $ge1 -replace '"', ""
$ge1 = $ge1 -replace ",", "`t"
$ge1 > "$pwd\ge2.csv"
sqlcmd -S .\sqlexpress -E -d gisq -Q "DELETE FROM ge"
sqlcmd -S .\sqlexpress -E -d gisq -Q "BULK INSERT ge FROM '$pwd\ge2.csv' WITH (FIELDTERMINATOR='\t', ROWTERMINATOR='\n', FIRSTROW=2)"

Run some queries

Now let's look at some data. How many female candidates were there?

select count(1) from ge where gender='female';

Who stood in Edinburgh South?

select surname,votes from ge where constituency_name='Edinburgh South';

You can now move to the next stage: 2015 UK General Election Normalising Data