Area classification for statistical wards - selection of variables
Introduction
The purpose of this paper is to explain the choice of variables for analysis in the ward level 2001 Area Classifications. The selection process is the same as that used for the Local Authority classification and it has turned out that the initial variables selected are similar with some minor changes. The underlying objective in variable choice is to select the minimum number of variables that will adequately represent the main dimensions in the census data. For presentation purposes we have defined these as demographic structure; household composition; housing; socio-economic character and employment. The data are from the 2001 Key Statistics tables produced from the Census and are available for wards at 2003 boundaries. It should be noted that wards with a population size of less than 1000 were combined with neighbours to ensure that all wards had a minimum of 1000 people. These are known as statistical wards and have been defined in this way for census outputs.
The steps involved in the selection of variables are summarised below:
Variables from the Census Key Statistics tables were considered for use.
Some variables were merged to create composite variables, for example, the variable 'Indian' represents people identifying as Indian, Pakistani or Bangladeshi.
Strongly correlated variables were removed by examining the correlation matrix. If a pair of variables had a Pearson correlation coefficient above 0.82 or below -0.82 then one of the pair was considered for exclusion. It was necessary to remove strongly correlated variables so that the aspect of census data that they represent did not have too much influence on the results.
Variables with problematic distributions (e.g. a high proportion of zeros) were not included.
In all cases, the decision to include or exclude a variable also involved using our own judgement. Continuity with previous classifications was considered when deciding whether to include or exclude a variable. Consultations regarding the selection of variables were carried out with the Area Classification Advisory Board. The Advisory Board proposed carrying out a principal component analysis to aid variable selection. This was tried but did not contribute anything further.
Initial set of variables considered
The Census key statistics had been identified by users as being the most important variables so the initial data set included all variables from the Ward Level Key Statistics Tables. Some variables from the census Standard tables were also considered but were inaccessible and difficult to use. Given that users had already identified the key statistcis variables as being the most important, those from the Standard tables were rejected. All variables suggested by the Advisory Board were considered as were all those that are included in the Local Authority Classification.
Reducing the initial set of variables
As the aim was to represent the main dimensions in the census data with the minimum number of variables the initial data set was reduced. If a variable didn't add anything to the classification it was removed. For example, the variables "age 65+" and "age 75+" capture many of the same population characteristics, "age 65+" was selected to represent those above the retirement age - it was also based on larger numbers, so is more reliable at ward level. In some cases a composite variable was used to reduce the number of variables, for example, Indian sub-continent ethnicity has been used to represent respondents identifying from each of the individual countries. Variables that identified very small sectors of the population were removed. It was not possible to include a migration indicator as this data had yet to be released for England and Wales. Religion was also omitted because the question was optional and there were many missing values. All variables used in the Local Authority classification were included in the initial set of variables considered.
Further reducing the variable set
A further reduction was made, as it is likely that some of these variables represent the same population characteristic and will therefore not provide extra information. For example, a high proportion of people at pensionable age might account for a high proportion of single pensioner households and a high rate of rooms per person. It was therefore necessary to further reduce this set of variables by identifying and removing strongly correlated variables. The Pearson correlation coefficient was used to identify those pairs of variables where it is likely that a characteristic would be given too much weight if both were to be included in the Classification. If a pair of variables had a correlation coefficient that was above 0.82 or below -0.82 then one of the pair was considered for exclusion. Pairs of correlated variables were not just considered in isolation but alongside other pairs of correlated variables. Judgements about which variables to exclude were made following group discussion, using all available evidence. The distributions of each variable were also examined and any variables with problematicdistributions (e.g. a high proportion of zeros) were not included.
A note concerning the included and excluded variables
We consider that the chosen variables represent the main dimensions of the census. Since an Urban/Rural indicator was not available we used Population density as a proxy Urban/Rural indicator. A distinction was made between different age groups in the population. It was suggested previously that those aged over 65 should be represented so this variable was included. The variable "age15-24" was not included as this is highly correlated with the variable "students" and "students" were thought to be a more distinct group than the whole of the 15-24 age group. The over 75s are highly correlated with pensioners so this was excluded. It was decided not to include the proportion of people who live in a communal establishment as there are a lot of wards with a zero value for this variable, and some wards with very high proportions, eg student residences. Communal establishements are very heterogeneous, including care homes, hostels, prisons, university residences etc.
The Indian and Black variables were included to represent ethnicity. The Chinese variable was not used as the Chinese population is small and fairly evenly distributed, so this would not contribute much to an area classification. In the initial selection we had included the proportion of people not born in the European Union and the proportion of people born in the European Union excluding the UK. But as membership to the European Union will change over time it was decided that it would be better to use the proportion of people not born in the UK instead. The only living arrangement variable that appears to be measuring a different group of people is the separated/divorced subgroup and so this was included. "Widows not living in a couple" was not included as this is highly correlated with single pensioners. It was suggested that a variable to represent adult children living with parents should be included so this variable was added. "Households with children" is highly correlated with "age 5-14". Keeping "Households with children" would mean removing the age variables "age0-4" and "age 5-14" as these would be already be covered, but "age 0-4" and "age 5-14" would represent different types of families.
There are four household composition variables that identify useful groups in the population. Renters from both the private and public sector were included plus terraced and detached houses. Purpose built flats and converted flats were excluded because they are correlated with all flats. Also there were a lot of wards that did not contain any converted flats. "Semi-detached" does not represent any distinct groups so was left out. We did not keep the second residence variable as this was not an actual question on the census form.
We did not need to include both average number of rooms and households size, so only the latter was used. The average number of people per room was used to measure overcrowding. The overcrowding variable, occrate, was not used as it only measures overcrowding whereas people per room represents people who have many rooms as well as those that have too few.
A "long termed unemployed" variable and "people with routine occupations" were included to represent different aspects of social exclusion. "Households with no adults in employment with dependent children" are highly correlated with "households containing one adult with dependent children", so this variable was not used. Conversely "people with H.E. qualifications" and "households with two or more cars" were included to represent the section of the population that is generally more prosperous. "People without qualifications" were highly correlated with those "working in a routine occupation" so was not kept. "People in professional or managerial occupations" were highly correlated with "people with H.E. qualifications" so this variable was not kept. "Average number of cars" and "two or more cars" are highly correlated. We decided to keep the variable "two or more cars" as a measure of affluency. "Households with no cars" are highly negatively correlated with "households that have two or more cars" so it was not necessary to include this variable as well.
The working from home variable may identify interesting sub-groups, as would those using public transport. People with limiting long term illness represent poor health. The percentage of residents who provide unpaid care identifies an important group of the population. The separate unemployment rates for men and women are highly correlated with total unemployment. There is also a high correlation between the unemployment rate for men and the unemployment rate for women so total unemployment was used instead of the two separate unemployment rates. We included men and women who work part-time separately and since a composite unemployment rate is used those who work full-time is not needed. Those who have never worked will be covered by the unemployment variable. Other employment variables are women who look after the home, students and the percentage of the population that work in some industrial subgroups. The disabled/sick variable is correlated with limiting long-term illness and only applies to people of working age. Those that are retired are already accounted for in the pensioner group so is not used.
List of final set of variables
The variables are listed under six domains, a short variable 'label' precedes the variable description.
Demographic
Age
age04 - percentage of resident population aged 0-4
age514 - percentage of resident population aged 5-14
age2544 - percentage of resident population aged 25-44
age4564 - percentage of resident population aged 45-64
age65ov - percentage of resident population aged 65 and over
Ethnicity
indian - percentage of people identifying as Indian, Pakistani or Bangladeshi
black - percentage of people identifying as Black African, Black Caribbean or Other Black (1)
Country of Birth
notuk - percentage of people not born in the UK
Population
popden - number of people per hectare
Household Composition
Living Arrangements
sepdiv - percentage of residents over 16 who are not living in a couple and are separated or divorced (2)
Size/Family
singper - percentage of households with one person who is not a pensioner
singpen - percentage of households which are single pensioner households
oneadk - percentage of households which are lone parent households with dependent children
twoank - percentage of households which are cohabiting or married couple households with no children
adultchi - percentage of households comprising one family and no others with non-dependent children living with their parents
Housing
Tenure
public - percentage of households that are public sector rented accommodation
private - percentage of households that are private/other rented accommodation
Type and size
terraced - percentage of all household spaces which are terraced
detached - percentage of all household spaces which are detached
allflats - percentage of all household spaces which are purpose built, converted and communal building flats
Quality/crowding
nocent - percentage of occupied household spaces without central heating
avhsize - average household size
peroom - average number of people per room
Socio-Economic
Education
hequal - percentage of people aged between 16 - 74 with a higher education qualification
Socio-economic class
routocc - percentage of people aged 16-74 in employment working in routine or semi-routine occupations
Ownership/commuting
twoplcar - percentage of households with 2 or more cars
pubtr - percentage of people aged 16-74 in employment usually travel to work by public transport (3)
workfrhm - percentage of people aged 16-74 in employment who work mainly from home
Health and Care
wrklli - percentage of working age population with limiting long term illness (8)
provun - percentage of people who provide unpaid care (7)
Employment
students - percentage of people aged 16-74 who are students (4)
unemp - percentage of economically active people aged 16-74 who are unemployed
ltunem - percentage of the unemployed who are long-term unemployed (5)
menpt - percentage of economically active men aged 16-74 who work part time (6)
womlah - percentage of economically inactive women aged 16-74 who are looking after the home
wompt - percentage of economically active women aged 16-74 who work part time (6)
Industry Sector
agricult - percentage of all people aged 16-74 in employment working in agriculture and fishing
minquarr - percentage of all people aged 16-74 in employment working in mining, quarrying and construction
manufac - percentage of all people aged 16-74 in employment working in manufacturing
hotelcat - percentage of all people aged 16-74 in employment working in hotel and catering
healthso - percentage of all people aged 16-74 in employment working in health and social work
finan - percentage of all people aged 16-74 in employment working in financial intermediation
whole - percentage of all people aged 16-74 in employment working in wholesale/retail trade
Footnotes
(1) includes Scottish Black for Scotish Unitary Authorities
(2) from KS03
(3) for Scottish Unitary Authorities this is percentage of residents usually travel to work or place of study by public transport
(4) This includes economically active full time students and economically inactive students
(5) Unemployed since 1999 or earlier
(6) Part-time is defined as working less than 30 hours a week
(7) Provides at least one hour a week of unpaid care
(8) working age is 16-64 for men and 16-59 for women