Instrument Testing and Validation

 
 

Introduction:

Epidemiology is the study of the health status of populations. The populations studied can be human or animal and the health status can include disease or determinants. Regardless of the subject, data collection methods vary. Generally, however, data is collected using written standard instruments designed to increase the comparability between the study groups (1). Having identified the researchable problem, the population and the variables to be measured, one must choose the procedure for collecting data and the measuring instrument to be used. The instrument chosen should be selected with forethought of the type of statistical analyses required once the data is collected (2).

Types of Instruments:

Instruments are simply devices for measuring the variables of interest. They can be in the form of record abstracts, questionnaires, physical examinations, bio-specimen collection or environmental samples (1). They could also be in the form of observational schedules, structured logbooks or standard forms for recording data from existing records (2). Questionnaires are the mainstay for epidemiologists though the other methods can be just as useful and each has its strengths and limitations.

Record abstracts are used to extract information from written records kept for another purpose. The forms can be difficult to design based on the complexity of the original documents and should be designed using closed-ended questions with prerecorded responses from which to select, so as to reduce the need for interpretation (1).

Physical examinations require more exact recording of findings than in a strict clinical setting. Subjects with no abnormal findings must be fully examined and described to ensure comparability within the study. Clinicians must be trained and adequate quality control utilized to ensure minimal variation. Forms must be extensively pre-tested to ensure ease of use during examination (1).

Laboratory components of epidemiologic studies require collecting biologic specimens from study participants. Problems of this type of instrument include laboratory error, reproducibility of the assay and storage of samples if testing is to be performed at a later date (1).

Environmental samples from soil, water, air, ionizing radiation and other environmental variables are collected and compared to and combined with the determinants under investigation (1).

Questionnaires:

The questionnaire is the most common instrument used in epidemiologic studies. Simply defined, a questionnaire is a standardized list of factual questions or elicited opinions (5). There are three basic components of a questionnaire: content, the form of the question and the level of data collected (4). Every word in a question can influence the validity and reliability of the responses. The meanings of words must be clear to all respondents and survey questions should be short and direct (10). The objective is to construct questions that are simple, unambiguous and encourage honest and accurate responses (4). Questions in a survey may be in an open-ended or closed-ended format. With open-ended questions, the respondent answers in his own words and with closed-ended questions respondents must choose from the pre-selected answers (4). Closed-ended formats provides the same frame of reference for all respondents to use in determining their answers (10). Open-ended formats allow freedom to answer questions without any limitations imposed by the researcher (4). Collecting data by questionnaires can either be by interview or by self-administration. Self-administered questionnaires are often more effective and consistent, while interviews allow the interviewer to clarify questions and incite complete and logical responses (4). The self-administered questionnaire is less of a social encounter than interview methods and can be posted to people to minimize social desirability and interviewer bias (2). Advantages of self-administered questionnaires include being less expensive, standardization and anonymity. Interviews may be conducted either in person or via telephone. Interview advantages include clarity, richness, universality and control (4).

The responses to questionnaires can be placed on a nominal, ordinal, interval or ratio scale. Nominal is the weakest measurement level, with numbers or other symbols used to simply categorize or sort a characteristic or item. An ordinal scale is one in which the variable is classified into ordered qualitative categories or put in a ranking order. An interval scale is characterized by an equal ordering of items and establishes equal intervals on a continuum (5). A ratio scale is similar to an interval scale except the ratio scale has a true zero point as its origin (3). Numeric and alphabetic codes transform answers into variables that can be tabulated and analyzed statistically.

Essentially, the instrument should be simple and easy to read, questions should be comfortable to read aloud, clear instructions and examples should be provided and the format should be uncluttered (4). Assurances of confidentiality, a brief introduction of the study and a thank you statement should all be included on the questionnaire (2).

Reliability and Validity:

The adequacy of a measuring instrument is determined by its reliability and validity (5). Two fundamental questions should be asked when selecting a measuring instrument. First, does the instrument measure a variable consistently? And second, is the instrument a true measure of the variable? The first is an indication of reliability while the second raises the issues of validity (5). Psychometric validation is the process by which an instrument is assessed for reliability and validity through the mounting of a series of defined tests on the population group for whom the instrument is intended (2).

Reliability refers to the reproducibility and consistency of the instrument, and the degree to which it is free from random error (2). There are several criteria that should be assessed before an instrument can be judged reliable. These include test-retest, inter-rater reliability and internal consistency. Test-retest is the measure in which the instrument is administered to the same population on at least two occasions and the results are correlated (3). Inter-rater reliability is the extent to which results obtained by two or more raters agree for the same population (2). Internal consistency is the concordance between two variables that measure the same general characteristic (4). Cronbach’s alpha is an estimate of internal consistency based on all possible correlation between all the items within the scale. Values range from 0 to 1 with no agreement over the minimum acceptable standards. A reliability coefficient of 0.70 implies that 70% of the measured variable is reliable and 30% is owing to random error, indicating that the item does not belong to the same conceptual domain (2).

Validity is an assessment of whether an instrument measures what it aims to measure. It should include standards of face, content, criterion, construct, both convergent and discriminant, and predictive validity (2). Evaluation of validity involves assessment against a standard criterion. Face refers to the subjective assessments of the presentation and relevance of the questionnaire (2). Content validity is the extent to which the content of the instrument appears to logically examine and comprehensively include the characteristic it is intended to measure (2). Criterion compares correlation of the measure with another measure, which is accepted as the ‘gold standard’. Convergent validity requires that instrument correlate with related variables, while discriminant validity requires that the instrument not correlate with disparate variables (2). In order for an instrument to be deemed valid it must also be precise or able to detect small changes in a characteristic (4). The instrument must also be responsive to changes within the population and the individual over time. Responsiveness is interrelated to sensitivity and specificity. Sensitivity refers to the number of actual cases who test or score positive by the instrument, while the specificity refers to the proportion of subjects that are not cases and test or score negative by the instrument (2).

Pilot Studies:

Once the development is complete, the instrument must be evaluated in a pre-test or pilot study. A pilot study is a preliminary, small-scale study performed to test all aspects of an instrument before proceeding with the actual study (11). Initial pre-tests improve the clarity of questions and instructions and should include a small number of subjects who represent the range of potential respondents in the study. Larger pre-tests are useful to perfect the range, reliability, efficiency and statistical characteristics of the instrument (4). Due to the great expense, time, and effort required to construct an instrument or questionnaire, new scales are often adaptations of existing scales (2).
 
 

Instrument Examples:

The Physical Activity Scale for the Elderly (PASE) is a 32-item questionnaire that examines twelve self-reported occupational, household and leisure activities. Testing PASE in elderly patients in a rural community, following a cardiac event, it was determined that the instrument was reliable, though its validity was questionable. This instrument was chosen because it included activities and terminology familiar to the elderly, it is available in large print for ease in reading and the items are sufficiently broad so as to be applicable to those living in a rural community and its self-reporting format (6).

The Strengths and Difficulties Questionnaire (SDQ) is a brief behavioral questionnaire that examines twenty-five attributes both positive and negative. There are two forms of the questionnaire, an informant-rated version for teachers and parents and a self-report version for those between the ages of 11 and 16. It was determined that although this instrument could be used to examine group differences, it could not be used to accurately diagnose individuals. The SDQ was found to be quite useful in the assessment of a young person’s degree of awareness of their own problems (7).

The Telephone Cognitive Assessment Battery (TCAB) is an instrument designed to be administered over the telephone to assess the cognitive status of older individuals. In an evaluation of the validity of the TCAB it was determined that the instrument successfully discriminated between mildly cognitively impaired persons and healthy normal subjects in a population of older participants in much the same way as in-person assessment (8).

The London Fibromyalgia Epidemiology Study Screening Questionnaire (LFESSQ) is an instrument to screen for fibromyalgia in general population surveys. The instrument tested pain criteria and fatigue criteria, using 4 and 2 questions, respectively. The four questions on pain were selected in accordance with the distribution of pain required by the American College of Rheumatology criteria. Meeting fatigue criteria required a "yes" response to both fatigue items. In a random survey of the general population the questionnaire was found to have a high positive predictive value (9).

The Eating Disorder Examination-Self-report Questionnaire (EDE-Q) is a 41-item instrument adapted from a structured clinical interview assessing the key behavioral features and associated psychopathology of eating disorders. The four sub-scales of the EDE-Q, restraint, weight concern, shape concern, and eating concern, all demonstrate excellent internal consistency and test-retest reliability. There was, however, less stability in items measuring the occurrence and frequency of the key behavioral features of eating disorders. Overall, the EDE-Q was determined to be a psychometrically sound self-report measure for the screening of eating disorders (12).

Cataract surgery produces changes in quality of life, patient self-assessment of visual function and health related quality of life are most appropriate for outcome measurements. The TyPE is a self-assessment questionnaire that examines visual function. In a study of cataract surgery patients, it was determined that the TyPE was valuable in measuring the effectiveness of that surgery as correlated with visual function, showing good test-retest reliability and construct validity (13).

References

 Links to Instruments:

Choose your instrument

Quality of Life Compendium

Toolkit of Instruments

WHO Quality of Life Instruments
 
 




























References

  1. Rothman KJ and Greenland S. Modern Epidemiology. Philadelphia: Lippincott-Raven, 1998.
  2. Bowling Ann. Research Methods in Health. Philadelphia: Open University Press, 1998.
  3. Bowling Ann. Measuring Health. Philadelphia: Open University Press, 1997.
  4. Hulley, SB and Cummings, SR. Designing Clinical Research. Baltimore: Williams & Wilkins, 1988.
  5. Stein Franklin. Anatomy of Clinical Research. New Jersey: SLACK Incorporated, 1989.
  6. Allison, MJ. Keller, C. Hutchinson, PL. Selection of an instrument to measure the physical activity of elderly people in rural areas. Rehabilitation Nursing. 1998: 23(6): 309-314.
  7. Goodman, R. Meltzer, H. Bailey, V. The strengths and difficulties questionnaire: a pilot study on the validity of the self-report version. European Child & Adolescent Psychiatry. 1998: 7: 125-130.
  8. Debanne, SM. Patterson, MB. Dick, R. Riedel, TM. Schnell, A. Rowland, DY. Validation of a telephone cognitive assessment battery. Journal of the American Geriatric Society. 1997: 45(11): 1352-1359.
  9. White, KP. Harth, M. Speechley, M. Ostbye, T. Testing an instrument to screen for fibromyalgia syndrome in general population studies: the london fibromyalgia epidemiology study screening questionnaire. The Journal of Rheumatology. 1999: 26(4): 880-884.
  10. Weisberg, HF. Krosnick, JA. Bowen, BD. An Introduction to Survey Research, Polling and Data Analysis. Thousand Oaks: SAGE Publications, 1996.
  11. Toma, B, ed. Dictionary of Veterinary Epidemiology. Iowa: Iowa State University Press, 1999.
  12. Luce, K. Crowther, JH. The reliability of the eating disorder examination: self-report questionnaire version (EDE-Q). International Journal of Eating Disorders. 1999: 25: 349-351.
  13. Lawrence, DJ. Brogan, C. Benjamin, L. Pickard, D. Stewart-Brown, S. Measuring the effectiveness of cataract surgery: the reliability and validity of a visual function outcomes instrument. British Journal of Ophthalmology. 1999: 83: 66-70.