Project objective
Users of Census data asked the Census offices to provide output which is complete and consistent. They did not wish to fill in gaps in tables containing 'not known' responses by having to make their own estimates for missing values. The pattern of non-responses is often different from that of reported data, and without access to the individual records users would not be able to correct for such non-response bias or accurately estimate values for derived variables which were based on more than one item. There would also be a danger that different users would make estimates for the complete population in different ways, creating inconsistencies between their results. An Edit and Imputation strategy was therefore put in place with the aim of estimating for all missing data and resolving inconsistencies in responses for the people and households affected.
For the 2001 Census we aimed to follow these principles:
All changes that were made would improve the quality of the data
The number of changes to inconsistent data would be kept to a minimum
As far as possible missing data would be imputed for all variables, so as to provide a complete and consistent database
The system had to be relatively easy to develop and capable of processing large amounts of data automatically within short timescales.
Background
For the 1991 Census, edit matrices had been developed for easy to code items to provide valid values to deal with inconsistencies. The most appropriate value was used, based on comparison with completed forms containing similar combinations.
Items could sometimes not be imputed in this way, because further inconsistencies arose. In these cases, and where there were missing values, an imputation system was put in place using a set of 50 tables which contained valid values for individual items. The tables were given an initial set of values based on earlier results, and as processing advanced they were updated with new values from wholly correct records. The most recent value, referenced by other characteristics of the person or household, was copied into the record requiring imputation known as the 'hot deck' method.
In developing a system for 2001, extensive tests were carried out on the use of neural networks which can detect complex relationships in data without the need for using complex modelling techniques. However, the testing failed to impute results which were consistent with the edit rules and neural networks did not perform as well as the 1991 hot deck method.
A donor edit and imputation system was also trialled, whereby all missing or inconsistent values on a record would be adopted from another similar individual (the donor). However, it was decided that setting specific values in the editing routines rather than basing them on similar donors would be more operationally efficient (although possibly less statistically accurate). Values would be set to missing for imputation if edit could not resolve an inconsistency. Thus an Edit and Donor Imputation System (EDIS) was devised for the 2001 Census. An extensive programme of specification and development of EDIS was undertaken which will be evaluated as part of the report on Downstream Processing.
EDIS was applied to individual records once the data had been loaded by ONS. It was designed to fill in almost all the gaps in records for existing people and households. A person was taken to exist if at least two of the name, date of birth and sex fields were completed. A process was applied prior to EDIS to remove any duplicate records, where someone had entered their data more than once or more than one form had been received for the same household.
The One Number Census process imputed for whole households and people who were missed from the Census. EDIS modified values for individuals on returned census forms.
Methodology
EDIS can be sub-divided into five elements:
Multi-tick rules dealt with cases where more than one box was ticked but only one option was allowed. In some cases there was a rule for selecting one tick. If more than half the boxes were ticked or a set of priorities for accepting one tick could not sensibly be made, the answer was treated as missing, and a value was supplied at the imputation stage.
Range checks were applied to prevent answers being outside an acceptable range. These were set to missing for subsequent imputation. Examples were households with 0 or more than 99 rooms, or with more than 20 cars, people with a date of birth before 1891 or after Census Day, who last worked before 1941 or who worked more than 99 hours per week.
Filter rules were applied to resolve some inconsistencies and to decide which fields should be set to 'No Code Required' where questions were answered but should not have been. For example, people under 16 or over 75 were not required to answer any of the employment questions. The variable Activity Last Week was also derived at this stage.
A set of Edit rules was applied to missing items or responses which appeared to be in error or inconsistent when compared with other data (such as married couples of the same sex, a child less than 13 years younger than its parents, or a married person under 16). These are known as hard checks.
In determining how to resolve such inconsistencies, the Fellegi/Holt principle of making the minimum number of changes was followed as far as possible. Thus if a person was under 16, married and had answered employment questions such as occupation, Age would be set to missing, since the inconsistency could be resolved with the least change by imputing a value for Age between 16 and 74.
Edit also identified unlikely, but not impossible responses. In some cases rules were applied to eliminate these: for example, a purpose-built flat was considered unlikely to have more than 10 rooms, and for reasons explained below the value was set to 'Missing' for imputation. In others no further action was taken, eg where people under 35 were retired from paid work. The number of these 'soft checks' was reported but the data were not changed as a result.
All items which were missing after the Edit stage were dealt with by the Imputation component, which is described below.
Imputation was applied when there was no answer on the Census form, it failed the multi-tick rules or was invalid, or the filter rules or Edit marked it for imputation to resolve an inconsistency.
The principle of a Donor Imputation System is to search for a single donor household to supply all the missing variables in a recipient household. Exceptions are imputation for postcode of usual address one year ago and of workplace, which were carried out at a later stage than imputation for other variables.
The search looked at all records in an Estimation Area, a group of contiguous Local Authority Districts of about 500,000 population. The method searched for a donor using up to five matching variables, which were determined by the fields requiring imputation on the recipient record. Values were copied over from the donor household to fill the missing values on the recipient record. Consistency checks were then applied and the donor rejected if any check failed.
Potential donor households were scored using a second set of matching variables relating to all people in the household. In addition, potential donors were penalised if they had been used before as a donor or if any of their fields had been edited or imputed. A record could not be used as a donor if any of the fields to be imputed were also missing on the donor. If potential donors still scored equally, the donor geographically closest to the recipient was chosen. However, to improve efficiency of the searching procedure, if a suitable donor was found who lived within 5,000 metres of the recipient, this person was accepted and no further search took place to find a closer donor.
The intention was to use joint imputation where possible, ie selecting a single donor household to impute for all the people with missing values in a recipient household so as to preserve the joint distributions between variables. If a suitable donor household could not be found for joint imputation, separate donors were sought to provide values for each person in the household requiring imputation, if necessary reducing the number of matching variables.
A fallback stage was also required as donor imputation failed to work for a few people. Most of these were imputed by testing possible values at random until one could be found which met the consistency criteria (a 'cold deck' approach). A small number of households could still not be completely resolved because of inconsistencies in age and relationships between people. As a final stage, ('son of fallback') if all else failed those containing up to eight people were completely replaced by synthetic households drawn at random from a set of the same household size, and households of nine or more people were corrected clerically.
The aim of imputation was to estimate the distribution of missing values accurately, so as to take account of any differences between the characteristics of respondents and non-respondents (non-response bias). It was not expected that the imputed values for every individual would be precisely accurate.
In comparison with 1991, EDIS was more comprehensive. It was applied to all variables, including qualifications, relationships, occupation, industry, hours worked, workplace address and means of transport to work, which were only analysed for a 10% sample of households and communals in 1991.
There was some manual intervention in the 1991 processing system, such as clerical checking of missing or inconsistent items which exceeded certain tolerances. EDIS was almost entirely automatic as clerical intervention was limited to households of more than eight people which failed the fallback stage.
Implementation
The EDIS system performed fully to specification, and ran well within the planned timetable. Imputation, which accounted for the bulk of the running time, was typically completed in 2 days for each of the 101 Estimation Areas in England and Wales.
In the analysis set out below, 'non-response' relates to failure to answer a question adequately, either because no response was supplied, or a value was out of range, inadequately described (in the case of occupation, industry or ethnic group), or multi-ticked. During Edit, some of these non-responses were set to a specific value. Edit also identified inconsistencies which led to a response being marked for imputation. The imputation rate may therefore be higher or lower than the rate of non-response.
People and households imputed by the One Number Census process are not included in these analyses.
Edit
A total of 13.7 million (m) edits were carried out on the data for 11.8m people. The base population for EDIS was 49.4m people in England and Wales, including some 0.6m students living away from home during term-time for whom only a few demographic and relationship questions applied at their home address. The eight most frequently executed edits accounted for 91% of the total. These were:
4.50m
Professional qualifications set to None where missing but educational qualifications was answered
2.29m
Carer set to No where missing unless Activity Last Week was also missing
1.66m
Workplace size set to 1-9 where person was self-employed
1.08m
Travel to work set to "work mainly at/from home" where workplace address was "mainly work at/from home"
1.03m
Supervisor set to No if missing, unless occupation was also missing
1.01m
Health set to Good if missing, unless Activity Last Week was also missing
0.59m
Professional qualifications set to missing if answered but educational qualifications was missing
0.40m
Missing Country of birth set to that of either siblings, parents or other related people in the household who have the same Country of birth
Imputation
One or more items needed to be imputed for 13.8m people - that is 28.0% of the population who returned Census forms. Of these, 4.7m were dealt with by joint imputation. 10.0m were imputed using individual imputation, including all those in single person households. 9.8m of the individual imputed cases used a donor household of the same size as the recipient's and the remaining 0.2m a household of different size. 0.4m people required imputation using the cold deck fallback method. Over 1m people had some items imputed by one method and some by another, hence there is some double-counting.
23.4% of the population were used once as donors, 2.1% twice and 0.1% three or more times.
For household variables, 2.5m needed imputation, 11% of all households. 0.08m were dealt with by fallback and the remainder by joint imputation. Almost all the donor households for joint imputation were used once each.
Person variables
Total
(including imputed)
Non-
response
Imputed
Non-
response
Imputed
000s
000s
000s
%
%
Age
49,359
262
278
0.53
0.56
Sex
49,359
199
219
0.40
0.44
Marital status
49,359
372
158
0.76
0.32
Student flag
49,359
622
641
1.26
1.30
Country of birth
48,848
1,211
829
2.48
1.70
Ethnic group
48,848
1,405
1,421
2.88
2.91
Welsh language
2,754
153
153
5.54
5.57
Religion
48,848
3,721
-
7.62
-
Health
48,848
1,525
531
3.12
1.09
Carer
48,848
2,967
693
6.07
1.42
Long-term illness
48,848
1,899
1,915
3.89
3.92
Address one year ago
48,848
2,198
2,213
4.50
4.53
Educational qualifications
35,367
2,187
-
6.18
-
Professional qualifications
35,367
6,094
-
17.23
-
Highest qualification
35,367
-
2,150
-
6.09
Working last week
35,367
737
-
2.08
-
Activity last week
35,367
-
1,301
-
3.69
Employment status
33,686
2,205
2,058
6.55
6.14
Workplace size
33,686
4,689
3,067
13.92
9.15
Supervisor
33,686
2,294
1,119
6.81
3.34
Occupation - currently working
21,741
694
759
3.19
3.48
Occupation - all ever worked
29,335
4,051
4,051
13.81
13.81
Industry - currently working
21,741
1,702
1,777
7.83
8.15
Industry - all ever worked
29,335
5,400
5,400
18.41
18.41
Workplace address
22,396
1,744
1,426
7.79
6.42
Method of travel
22,533
1,410
1,127
6.26
5.07
Hours worked
22,533
1,804
1,506
8.00
6.77
Relationship to Person 1
28,065
971
1,326
3.46
4.73
Note:
The 'Total' column refers to the number of people in scope for the question, ie:
- Age, Sex, Marital status, Student flag: All people plus students who were counted at both their home address and term-time address in England and Wales
- Country of birth, Ethnic group, Religion, Health, Carer, Long-term illness, Address one year ago: All people (students counted at term-time address only)
- Welsh language: All people living in Wales
- Qualifications, Working last week: All people aged between 16 and 74
- Employment status, Company size, Supervisor: All people aged between 16 and 74 who had ever worked
- Workplace address, Method of travel, Hours worked: All people aged between 16 and 74 who were working in the week before Census day
- Relationship to Person 1: All people in households plus students also counted at home address less those who were entered as Person 1 on census form
Age
Age was not reported or was out of range (born after Census day or more than 110 years old) for 240,000 people. It was set to missing for a further 23,000 on grounds of inconsistency, mainly because people who were not single and who had answered three or more employment questions had their age captured as under 16.
Imputed
Total (including
imputed)
Age group
000s
%
000s
%
0 - 4
15
5.6
2,800
5.7
5 - 9
12
4.4
3,066
6.2
10 - 14
10
3.7
3,233
6.5
15 - 19
26
9.3
3,143
6.4
20 - 24
23
8.4
3,032
6.1
25 - 29
21
7.4
3,061
6.2
30 - 34
22
8.0
3,655
7.4
35 - 39
20
7.3
3,819
7.7
40 - 44
19
7.0
3,461
7.0
45 - 49
18
6.3
3,157
6.4
50 - 54
21
7.6
3,470
7.0
55 - 59
16
5.8
2,876
5.8
60 - 64
16
5.8
2,476
5.0
65 - 69
12
4.3
2,238
4.5
70 - 74
11
4.0
2,029
4.1
75 - 79
6
2.0
1,717
3.5
80 - 84
4
1.5
1,146
2.3
85 & over
4
1.6
984
2.0
The distribution of imputed ages followed that of the remainder of the population except for a shortfall among the 0, 6-15 and 76-80 age groups. This is primarily because some people were imputed as aged between 16 and 74 who may have been outside this age range because some employment questions had been answered. The shortfall in babies under 1 year old occurred where their address one year ago had not been stated as 'no usual address'. The effect in an area of 100,000 population would typically be that 2 or 3 under 1's would have been imputed as over 1.
Sex
Sex was missing for 185,000 people and multi-ticked for 14,000, 0.4% of the population in total. There were no edit actions which directly affected this question: if a husband and wife, or the parents of a child, were of the same sex the relevant relationships were imputed. A further 20,000 had values imputed by 'son of fallback'.
Imputed
Total (including
imputed)
000s
%
000s
%
Female
113
51.7
25,473
51.6
Male
106
48.3
23,887
48.4
The sexes were imputed in the ratio of 51:49 in favour of females, very similar to the proportions among the remainder of the population. The accuracy of imputations was assessed by comparing the imputed values with people's names in a sample of areas. This showed that 75% of imputations were correct. Among the incorrect values there was a very slight bias towards imputing females. The net effect would be to count four people out of every 100,000 as female rather than male.
Marital Status
There were 373,000 missing or multi-ticked cases for marital status, representing 0.8% of the population. 232,000 of these were children under 16 who were set to Single in edit. A further 6,000 under 16s had marital status changed to Single. Imputation was applied to the remainder. Married and Re-married were less likely to be imputed than among the remainder of the population.
Student
Question 5 on the person schedule asked whether a person was a schoolchild or student in full-time education. 1.3% of people failed to answer or multi-ticked the question, of whom 13% were imputed as students compared with 21% in the remainder of the population.
Country of Birth
Country of birth was omitted by 2.5% of people. Of these, 88% were imputed as born in the United Kingdom, compared to 92% in the remainder of the population. People born in Africa, Asia and North America were imputed in higher proportion than the remainder of the population.
Ethnic Group
The non-response rate for ethnic group was 2.9%. 89% of these were imputed as White compared with 92% in the remainder of the population. There were higher proportions of imputed people in the Mixed, Asian and Black groups.
Welsh Language
The question asking whether people could understand spoken Welsh, or speak, read or write the language, was asked of all people living in Wales. There was a 5.5% non-response rate. No knowledge of Welsh was imputed slightly more often than for the remainder of the population.
Religion
As the question on religion was voluntary, non-responses were not imputed but will appear in tables as 'not stated'. The national non-response rate was 7.6%.
General Health
This question asked whether over the last twelve months a person's health had on the whole been good, fairly good or not good. The non-response rate was 3.1%, but an edit rule set the value to good unless Activity Last Week was also missing. This reduced the number requiring imputation to 1.1%. Among these people, Fairly Good and Not Good were imputed slightly more frequently than in the remainder of the population.
Carer
Question 12 referred to voluntary help or support given to family members, friends or neighbours. The rate of non-response was 6.1%. Missing values were set to No by an edit rule unless Activity Last Week was also missing, and children under 5 were also assumed to not be providing care. Of the remaining 1.3% of the population, 11% were imputed as Carers in comparison to 10% among the remainder of the population.
Long-term Illness
There was a 3.9% non-response rate to this question, which asked about any long-term illness, health problem or disability which limited the person's daily activities or the work they could do. 22% of these were imputed as having such a condition in comparison with 18% among the remainder of the population.
Address One Year Ago
This question had a non-response rate of 4.5%. No usual address was imputed more often than among the remainder of the population, mainly because there was a high rate of non-response for children under 1.
Qualifications
This topic was covered by two questions, on educational and professional qualifications, which had non-response rates of 6.2% and 17.2% respectively. Where missing, professional qualifications was set to None by an edit rule if the educational qualifications was answered. Professional qualifications was set to missing if educational qualifications was not answered. Taking the responses to the two questions together, a new variable called highest qualification was derived. After applying the edit rules, 6.1% of people needed to have highest qualification imputed. People with imputed values were more likely to have no qualifications (Level 0) than the remainder of the population.