Introduction
At the beginning of September 2002, ONS announced its intention to hold a consultation on the content of the 2001 SARs. This paper gives details of the consultation. It summarises the reasons why the Census Offices places great importance on protecting the confidentiality of individual information, the analysis that has been completed to assess the disclosure risk with the SARs and the changes that are likely to be made to the format of the SARs when the 2001 version is released.
The SARs have provided a valuable research dataset over the last 10 years and the Census Offices are keen to produce a similar dataset for 2001. There is, however, a legal obligation to protect the confidentiality of the individual information that is released in the SARs and that the data that is released is safe from disclosure risk.
A great amount of analysis has been undertaken to assess the risk of disclosure. To meet our legal requirements the Census Offices have judged some changes to the format of the SARs are required. These changes are likely to be in the level of detail that can be released for certain, highly visible or disclosive, variables. Although these changes are necessary, there is some scope for users to influence where these changes will be made, for example, it is possible to provide more detailed information for some variables but with less geographic detail.
ONS recognises that these changes will have an impact on the analysis that is carried out on the SARs. It is actively exploring other ways to allow users access to more detailed data. One possibility is a microdata laboratory and this is discussed later on in this document.
Protecting Confidentiality
The Census Offices have a clear, well published, goal for protecting the confidentiality of individual information:
...In releasing statistics from the Census, all possible steps will be taken to prevent the inadvertent disclosure of information about identifiable individuals and households.
The Registrars General also have a legal obligation not to reveal information collected in confidence in the Census about individual people and households, and have given public assurances about what this means in practice. In presenting very detailed results from the Census, protecting individual information is of key importance. Traditionally the confidentiality of Census output is protected by a combination of a variety of 'disclosure control' methods.
As well as the legal aspect of disclosure control ONS has also stated in the 2001 Census Disclosure Control advisory group paper AG0106 that:
"Maintaining the confidentiality of individual data underpins the trust that exists between data suppliers and any agency that acts as custodian of information about them. At ONS we are fortunate that businesses and the public have confidence that their information is securely held and that we do not release any data that could identify an individual. It is essential that this trust be maintained......".
Protecting the confidentiality of details about individual people becomes less simple with each Census, as the amount of accessible and publicly available information about individuals increases. We also know of more information that can be matched statistically with the Census, and electoral rolls are now more widely used in electronic form. Alongside this, for the 2001 Census, we are releasing a larger range of small area statistics, notably because we no longer obtain any key measures from just 10 per cent of the population. We plan to publish much more small area information from all public records over the next three years, with Neighbourhood Statistics.
Since the 1991 Census, the Internet has transformed the potential for making census results widely accessible to citizens. Changing attitudes to the trust in which public agencies are held and concerns about the importance of privacy of personal information also place new and more onerous demands on bodies responsible for protecting such information supplied in confidence.
The general strategy for ensuring the statistical confidentiality of 2001 Census output was stated in the Government's March 1999 White Paper The 2001 Census of Population (Cm 4253):
"Precautions will be taken so that published tabulations and abstracts of statistical data do not reveal any information about identifiable individuals or households. Special precautions may apply particularly to statistical output for small areas. Measures to ensure disclosure control will include some, or all, of the following procedures:
restricting the number of output categories into which a variable may be classified, such as aggregated age groups;
where the number of people or households in an area falls below a minimum threshold, the statistical output - except for basic headcounts - will be amalgamated with that for a sufficiently large enough neighbouring area; and/or
modifying the data before the statistics are released."
All of this has lead ONS to reassess how much detail be released from the 2001 Census. Additional measures have already been introduced for tabular output and further measures will need to be taken for the SARs.
Disclosure Risk Assessment
ESRC, through CCSR, have made a request for 2001 SARs. They have also requested if ONS would consider the following enhancements to the 1991 SARs specification:
reduce the threshold for the Individual SAR from 120k to 90k
increase the sample size for the Individual SAR from 2% to 3%
changes in detail given to some of the variables for example ethnic group, family type and professional qualifications to reflect changes in the information collected in 2001.
add extra variables (to reflect the new questions asked in the 2001 Census)
These proposals are based on the paper by Dale & Elliot; 'Proposals for the 2001 SARs: an assessment of disclosure risk'. This paper assessed the risk of disclosure from the SARs and concluded that the risk was very low. It suggested that the 1991 assessment of risk was pessimistic and there was scope for a decrease in the threshold and an increase in the sample size of the individual SAR.
ONS has carried out further analysis to assess the risk. In particular, ONS recognised that a risk assessment for the country as a whole would not necessarily allow it to meet the commitments it has made to every individual who completed a Census form. In particular, some individuals are more easily recognisable in the population than others. The Census Offices have a responsibility to protect everyone's information not just the majority.
ONS also considered how an attempt could be made to identify an individual. It considered what additional information and data would be available to users of the SARs (regardless of whether it was in the public domain) and whether this information could be used to identify an individual in the SARs.
The main elements of the analysis were:
an analysis to determine whether or not a variable should be collapsed - a similar analysis was carried out in 1991 see: The 1991 Census User's Guide, Edited by Dale & Marsh, HMSO: Chapter 5.4.4
an analysis of the number and proportion of unique individuals in the sample who are also unique in the population. This looked at the total population as well as groups within it.
a qualitative assessment of the risk that an individual within the SARs can be identified by matching the SARs against an external dataset.
Main Findings
The following table illustrates some of our results by looking at the probability of a sample unique being a population unique. It looks at how the probability decreases as more variables are banded. (These figures may differ from others that have been made available due to different banding of variables).
Probability of a sample unique being a population unique
Variables used to identify a unique
2% Sample
3% Sample
1. CCSR proposed banding
Age (single years to 97, then 97+)
Ethnic - 16 categories
Occupation - 3 digit
Sex
Marital Status
11%
13%
2. Age banded
Age (single years to 15, 16 to 17, 18 to 21, 22 to 24, 25+ grouped)
Ethnic - 16 categories
Occupation - 3 digit
Sex
Marital Status
8%
10%
3. Age and Ethnic Group banded
Age (as above)
Ethnic - 5 categories
Occupation - 3 digit
Sex
Marital Status
6%
8%
4. Age, ethnic group and occupation banded
Age (as above)
Ethnic - 5 categories
Occupation - 2 digit
Sex
Marital Status
4%
6%
It should be stressed that these give an indication of how the risk has been reduced by banding variables. However, there is not a threshold below which ONS believes the risk is acceptable which is why we will also conduct an analysis on the final sample to identify special uniques in the population (see later).
It is a matter of judgement as to what risk is acceptable and some is not measurable using probabilities. This underlies why we have also considered the risk from an intruders perspective. For instance, it might be argued that the risk of producing a 2% SAR and banding age and ethnic group and keeping the 3 digit occupation is the same as producing a 3% SAR and banding age, ethnic group and occupation. Our judgement is that there is a larger risk associated with having more detailed occupations in the SARs as this information could be used to make a match on an individual.
This analysis showed that grouping of age, ethnic group and occupation substantially reduced the risk of identifying an individual from the sample. It also showed that the sample size could be increased from 2% to 3%.
ONS also looked at the risk of identifying individuals by matching databases against other sources and whether or not some of the variables may be able to help in confirming the identify of individuals. The analysis showed that variables such as the area classification, communal establishment type and family type could all significantly increase the risk by substantially narrowing down the location of an individual or groups of individuals in the population. This analysis suggested that either these variables should not be included in the SARs or that further grouping was necessary.
In coming to its final decision for all variables ONS has had to use a large degree of judgement and guided, where necessary, by statistical analysis.
ONS recognises that these levels of grouping will be a disappointment to users. It is therefore proposing:
a.
an additional individual SAR where the geographical indicator will be Government Office Region (GOR). This SAR will contain more detailed information than the LAD SAR and this is shown in the detailed specifications below
Therefore, there will be 2 individual SARs and users are asked what sample sizes they would prefer for each SAR. The constraints are that the sum of the sample sizes must not exceed 3% and that the LAD SAR cannot exceed 2%. The detailed specification of ONS's proposals are below.
exploring the possibility of creating a microdata laboratory see details later.
Conclusions
The SARs specifications are attached. For most variables it gives the banding used in 1991 (where applicable), the ESRC proposal, the ONS proposal and reasons for any differences. The variables which show differences between proposals are highlighted.
Information is given below for key variables which differ. For ease of reference we refer to three SARs: the LAD SAR as the individual SAR with an LAD indicator, the GOR SAR as the individual SAR with a GOR indicator and the Household SAR.
1. Age to be banded
Age in single years presents a large risk of disclosure in the SARs. Age can be used in combination with other variables to help identify an individual.
Banding Age significantly reduces these risks. ONS proposes that age will be grouped as follows for the LAD SAR. Single years for 0 to 15 16-17, 18-21, 22-24 and five year age bands for 25+ up to age 85, 85+. Single years of age up to 94 will be available for the GOR SAR. Users are asked to comment on the proposals for the LAD SAR and in particular whether the proposed banding for 16 to 24 year olds is suitable for their needs.
For the household SAR we can provide information on (figures in brackets indicate the household sizes that would be available if only a country identifier was available rather than a GOR identifier):
(1) single years of age for households with up to 5 (6) people
(2) 5 year age bands for household sizes with 6 (7) people or
(3) 10 year age bands for households with 7 (8) people.
Further information may be available for larger households. For example, we may be able to provide household size and ethnic composition.
Users are asked:
(1) whether or not they wish to have the GOR identifier on the household SAR,
(2) to provide information that they would like included for larger households where individual information will not be provided.
2. Ethnic Group to be banded
As with Age, banding Ethnic group will significantly reduce the risk of finding an individual by matching it with other variables (especially occupation). Only 5 ethnic groups can be made available for the individual LAD SAR. A 16-way classification will be available for the GOR SAR. In Scotland a five way classification for LAD and a 14 way classification for the GOR equivalent to those used in the Area Statistics in Scotland will be used. For Northern Ireland there will simply be a two way split between white and non-white.
3. Occupation codes
Traceable and visible occupations were examined. These included occupations such as doctors, police, actors, sports players. Individuals in these occupations are vulnerable to being identified or matched against external databases particularly when presented in conjunction with other variables such as age, ethnic group or geography.
This risk is substantially reduced by restricting the occupational detail to the sub-major (2 digit) codes for the LAD SAR. The minor (3 digit) occupation codes will be provided in the GOR SAR. The household SAR will also include the minor (3 digit) occupations.
4. Communal Establishment
There are now two communal establishment variables. Communal establishment type (17 categories) and management type (6 categories). Providing this information will increase the risk of disclosure. It is necessary to collapse the communal establishment type into 3 categories and the management type into 2 and users are asked for their views on how they would like this split made.
People living in communal establishments will be broken down between residents and non-residents (i.e. staff and managers will be grouped together).
This will apply to both the LAD and GOR SAR.
5. Banding country of Birth
There is too great a risk with providing detailed information on country of birth. 11 Categories will be provided in both the LAD and GOR SAR.
6. Pensioner Households
Users asked if they could have both a count of the number of pensioners and the number of over 65's. It is only possible to have one of these options and users are asked to express which one they would prefer. This will apply to both the LAD and GOR SAR.
7. Distance Travelled
Distance moved for migrants and distance travelled to work (or, in Scotland, study) will be banded in the LAD SAR. The request of users for more detail will be provided in the GOR SAR.
8. Hours
Hours will be top coded at 80 hours for both the LAD and GOR SAR.
9. Family type
Same sex couples will not be identified in the LAD or GOR SAR.
10. Professional Qualifications
Information on whether or not an individual contains a professional qualification will be provided. However, we will not be able to provide the type of professional qualification in either SAR.
These measures will reduce the risk that an individual will be able to be identified in the SARs. However, there still remains a risk that some individuals will remain identifiable despite these measures. Therefore, a further analysis will be performed on the data to identify any special uniques that may remain.
Special Uniques analysis
Mark Elliot of Manchester University is currently developing this method. His method will identify individuals whose characteristics are unique in the country as a whole and who are not protected by grouping variables or restricting the level of geographic detail given.
Mark Elliott's method will be applied to the 2001 SARs to identify any of these unique records. Various options will be considered on how to reduce the risk for these records including suppression and recoding.
Other Points that were made in the ESRC proposal
Increase in Sample Size
We will increase the overall size of the individual SAR from 2% to 3%.
Decrease in threshold
There may be a possibility to reduce the threshold from 120K to 90K. Further analysis is underway and we will let users know the outcome when we produce the results of the consultation.
Small Area Microdata, SAM, (a 5% SAR with a reduced set of banded variables)
ONS wishes to produce SAM. Work is underway to assess the disclosure risks and once this is completed we will let users have our proposals. It has also been suggested that we explore what additional information could be provided in the SAM if the population threshold was increased from 15K to 25K or 30K.
Imputed Values
Users have asked if it would be possible to identify records which contain imputed values and also those records which have been imputed using by the ONC process. We are assessing what information on imputation will be made available from the Census. This will be considered in the assessment and we will let users know when the assessment has been completed.
Splitting Large LADs
The ESRC asked if it would be possible to split large LADs into smaller areas provided these new areas are larger than the population threshold. We will be prepared to consider this on a case by case basis. This may also be possible for large GORs.
Microdata Laboratory
ONS have begun to look at the feasibility of creating a safe setting or microdata laboratory where more detailed Census microdata can be analysed. It is still in its early stages and we will keep users informed of its progress. GROS cannot give a commitment until proposals become clearer that Census data for Scotland would be available in this way.
Differences between Countries
This consultation is for England and Wales, Scotland and Northern Ireland. In most cases we are consulting on a similar basis. There are, however, a few exceptions. For ethnic group Scotland propose having a five way classification for LAD SAR and a 14 way classification for the GOR SAR equivalent to those used in the Area Statistics. For Northern Ireland there will simply be a two way split between white and non-white.
Other differences will also occur reflecting the different questions asked for example language (Welsh, Gaelic, Irish) and travel (includes travel to place of study in Scotland).
Initial Feedback from 11 October SARs user group meeting
A meeting was held on 11 October where these proposals were discussed. ONS also provisionally suggested that it may be possible to reduce the population threshold from 120k to 90k. More work is still required on this as it will increase the disclosure risk. We will be able to give a decision at the end of the consultation.
Users gave initial feedback and also expressed disappointed at the proposals. The main points are included below and ONS will consider these:
Users asked ONS to explore the possibility of producing a third individual SAR which had no geographic detail but more detailed information on occupation.
Could large GORs be split into smaller regions for the GOR SAR.
Could further information on households be provided for the household SAR given that individual information for large households will not be provided.
Given the proposals, there was generally positive feedback for the microdata laboratory. This depended, however, on how much detail will be provided in the SARs held in the laboratory and how and where the data will be provided.
Users suggested that a sensitivity analysis could be carried out to determine what level the threshold should be set (for example should it be 95K or 85K) if we were able to reduce the population threshold.
Areas for Consultation
The list below summarises the areas which we would like specific feedback. Users are also welcome to provide feedback on these proposals and any other suggestions which they would like us to consider. However, we will only be able to consider suggestions where we consider the risk to the confidentiality of individual information will not be increased.
Split between the LAD and GOR sample sizes
There will be 2 individual SARs and users are asked what sample sizes they would prefer for each SAR. The constraints are that the sum of the sample sizes must not exceed 3% and that the LAD SAR cannot exceed 2%.
At the SARs user group meeting, users were asked about the possibility of having a third individual SAR at the national level. This is a possibility, however, the constraint of a maximum sample size of 3% will still be exist. ONS is looking at what extra detail could be provided in such a SAR and users are welcome to comment on this.
Age
Users are asked to comment on the proposals for the LAD SAR and in particular whether the proposed banding for 16 to 24 year olds is suitable for their needs.
Household SAR
Users are asked:
(1) whether or not they wish to have the GOR identifier on the household SAR,
(2) to provide information that they would like included for larger households where individual information will not be provided.
Communal Establishments
Advice is sought on how to collapse the communal establishment type into 3 categories and the management type into 2.
Pensioner households
Users are asked whether they would prefer to have all pensioner households identified or all over 65 households identified.
Making Responses
ONS would welcome comments on its proposals and in particular its areas for consultation.