For ethical and economic reasons, it is important to design animal experiments well, to analyze the data correctly, and to use the minimum number of animals necessary to achieve the scientific objectives—but not so few as to miss biologically important effects or require unnecessary repetition of experiments. Investigators are urged to consult a statistician at the design stage and are reminded that no experiment should ever be started without a clear idea of how the resulting data are to be analyzed. These guidelines are provided to help biomedical research workers perform their experiments efficiently and analyze their results so that they can extract all useful information from the resulting data. Among the topics discussed are the varying purposes of experiments (e.g., exploratory vs. confirmatory); the experimental unit; the necessity of recording full experimental details (e.g., species, sex, age, microbiological status, strain and source of animals, and husbandry conditions); assigning experimental units to treatments using randomization; other aspects of the experiment (e.g., timing of measurements); using formal experimental designs (e.g., completely randomized and randomized block); estimating the size of the experiment using power and sample size calculations; screening raw data for obvious errors; using the t -test or analysis of variance for parametric analysis; and effective design of graphical data.
animal experiments, experimental design, statistics, variation
Experiments using laboratory animals should be well designed, efficiently executed, correctly analyzed, clearly presented, and correctly interpreted if they are to be ethically acceptable. Unfortunately, surveys of published papers reveal that many fall short of this ideal, and in some cases, the conclusions are not even supported by the data ( Festing 1994 ; Festing and Lovell 1995 , 1996 ; Mc-Cance 1995 ). This situation is unethical and results in a waste of scientific resources. In contrast, high-quality methods will help to ensure that the results are scientifically reliable and will not mislead other researchers.
The aim of these guidelines is to help investigators who use animals ensure that their research is performed efficiently and humanely, with the minimum number of animals to achieve the scientific objectives of the study. Some knowledge of statistics is assumed because most scientists will have had some training in this discipline. However, scientists using animals should always have access to a statistician who can help with unfamiliar or advanced methods.
These guidelines and suggestions for further reading are based partly on previously published guidelines for contributors to medical journals ( Altman et al. 2000 ) and for in vitro experiments ( Festing 2001 ). Although a useful set of guidelines for “appropriate statistical practice” in toxicology experiments has previously been published ( Muller et al., 1984 ), with a more extensive set of suggestions for the design and analysis of carcinogenicity studies ( Fairweather et al. 1998 ), general guidelines aimed specifically at experiments using laboratory animals in both academic and applied research do not appear to have been published recently. However, a recent book covers in more detail much of the ground discussed here ( Festing et al. 2002 ).
Although responsibility for the quality of research rests clearly with those who perform it, we believe journal editors should ensure adequate peer review by individuals knowledgeable in experimental design and statistics. They should also ensure that there is a sufficiently full description of animals, experimental designs, and statistical methods used and should encourage the discussion of published papers through letters to the editor and, when possible, by suggesting that authors publish their raw data electronically ( Altman 2002 ).
The use of animals in scientific experiments likely to cause pain, distress, or lasting harm generates important ethical issues. Animals should be used only if the scientific objectives are valid, there is no other alternative, and the cost to the animals is not excessive. “Validity” in this case implies that the experiment has a high probability of meeting the stated objectives, and these objectives have a reasonable chance of contributing to human or animal welfare, possibly in the long term.
The following “3Rs” of Russell and Burch (1959) provide a framework for considering the humane use of animals:
Animals should be replaced by less sentient alternatives such as invertebrates or in vitro methods whenever possible.
Experimental protocols should be refined to minimize any adverse effects for each individual animal. For example, appropriate anesthesia and analgesia should be used for any surgical intervention. Death is not an acceptable endpoint if it is preceded by some hours of acute distress, and humane endpoints should be used whenever possible ( Stokes 2000 ). Staff should be well trained, and housing should be of a high standard with appropriate environmental enrichment. Animals should be protected from pathogens.
The number of animals should be reduced to the minimum consistent with achieving the scientific objectives of the study, recognizing that important biological effects may be missed if too few animals are used. Some thought also should be given to the required precision of any outcomes to be measured. For example, chemicals are classified into a number of groups on the basis of their acute toxicity in animals. It may not be necessary to obtain a highly precise estimate of the median lethal dose (LD 50 value) to classify them. A number of sequential experimental designs that use fewer animals have been developed for this purpose ( Lipnick et al. 1995 ; Rispin et al. 2002 ; Schlede et al. 1992 ). Ethical review panels should also insist that any scientist who does not have a good background in experimental design and statistics should consult a statistician.
All research should be described in such a way that it could be repeated elsewhere. Authors should clearly state the following:
The objectives of the research and/or the hypotheses to be tested;
The reason for choosing their particular animal model;
The species, strain, source, and type of animal used;
The details of each separate experiment being reported, including the study design and the number of animals used; and
The statistical methods used for analysis.
Experiments and Surveys
An experiment is a procedure for collecting scientific data on the response to an intervention in a systematic way to maximize the chance of answering a question correctly (confirmatory research) or to provide material for the generation of new hypotheses (exploratory research). It involves some treatment or other manipulation that is under the control of the experimenter, and the aim is to discover whether the treatment is causing a response in the experimental subjects and/or to quantify such response. A survey, in contrast, is an observational study used to find associations between variables that the scientist cannot usually control. Any association may or may not be due to a causal relation. These guidelines are concerned only with experiments.
Experiments should be planned before they are started, and this planning should include the statistical methods used to assess the results. Sometimes a single experiment is replicated in different laboratories or at different times. However, if this replication is planned in advance and the data are analyzed accordingly, it still represents a single experiment.
Confirmatory and Exploratory Experiments
Confirmatory research normally involves formal testing of one or more prespecified hypotheses. By contrast, exploratory research normally involves looking for patterns in the data with less emphasis on formal testing of hypotheses. Commonly, exploratory experiments involve many characters. For example, many microarray experiments in which up or down regulation of many thousands of genes is assayed in each animal could be classified as exploratory experiments because the main purpose is usually to look for patterns of response rather than to test some prespecified hypotheses. There is frequently some overlap between these two types of experiment. For example, an experiment may be set up to test whether a compound produces a specific effect on the body weight of rats—a confirmatory study. However, data may also be collected on hematology and clinical biochemistry, and exploratory investigations using these data may suggest additional hypotheses to be tested in future confirmatory experiments.
Investigations Involving Several Experiments
Scientific articles often report the results of several independent experiments. When two or more experiments are presented, they should be clearly distinguished and each should be described fully. It is helpful to readers to number the experiments.
Animals as Models of Humans or Other Species
Laboratory animals are nearly always used as models or surrogates of humans or other species. A model is a representation of the thing being modeled (the target). It must have certain characteristics that resemble the target, but it can be very different in other ways, some of which are of little importance whereas others may be of great practical importance. For example, the rabbit was used for many years as a model of diabetic humans for assaying the potency of insulin preparations because it was well established that insulin reduces blood glucose levels in rabbits as well as in humans. The fact that rabbits differ from humans in many thousands of ways was irrelevant for this particular application. This was a well-validated model, but it has now been replaced with chemical methods.
Other models may be less well validated; and in some cases it may be difficult, impossible, or impractical to validate a given model. For example, it is widely assumed that many industrial chemicals that are toxic at a given dose in laboratory animals will also be toxic to humans at approximately the same dose after correcting for scale. However, it is usually not possible to test this assumption. Clearly, the validity of an animal model as a predictor of human response depends on how closely the model resembles humans for the specific characters being investigated. Thus, the validity of any model, including mathematical, in vitro, and lower organism models, must be considered on a case-by-case basis.
Need to Control Variation
After choosing a model, the aim of the experiment will be to determine how it responds to the experimental treatments). Models should be sensitive to the experimental treatments by responding well, with minimal variation among subjects treated alike. Uncontrolled variation, whether caused by infection, genetics, or environmental or age heterogeneity, reduces the power of an experiment to detect treatment effects.
If mice or rats are being used, the use of isogenic strains should be considered because they are usually more uniform phenotypically than commonly used outbred stocks. Experiments using such animals either should be more powerful and able to detect smaller treatment responses or could use fewer animals. When it is necessary to replicate an experiment across a range of possible susceptibility phenotypes, small numbers of animals of several different inbred strains can be used in a factorial experimental design (see below) without any substantial increase in total numbers ( Festing 1995 , 1997 , 1999 ). The advantage of this design is that the importance of genetic variation in response can be quantified. Inbred strains have many other useful properties. Because all individuals within a strain are genetically identical (apart possibly from a small number of recent mutations), it is possible to build up a genetic profile of the genes and alleles present in each strain. Such information can be of value in planning and interpreting experiments. Such strains remain genetically constant for many generations, and identification of individual strains is possible using genetic markers. There is a considerable literature on the characteristics of the more common strains, so that strains suitable for each project can be chosen according to their known characteristics (Festing 1997, 1999; < www.informatics.jax.org >).
Animals should be maintained in good environmental conditions because animals under stress are likely to be more variable than those maintained in optimum conditions ( Russell and Burch 1959 ). When a response is found in the animal, its true relevance to humans is still not known. Thus, clinical trials are still needed to discover the effects of any proposed treatment in humans. However, in testing toxic environmental chemicals, it is normally assumed that humans respond in a similar way to animals, although this assumption can rarely be tested. The animals should be adequately described in the materials and methods or other relevant section of the paper or report. The Appendix provides a checklist of the sort of information that might be provided, depending on the individual study.
The experimental design depends on the objectives of the study. It should be planned in detail, including the development of written protocols and consideration of the statistical methods to be used, before starting work.
In principle, a well-designed experiment avoids bias and is sufficiently powerful to be able to detect effects likely to be of biological importance. It should not be so complicated that mistakes are made in its execution. Virtually all animal experiments should be done using one of the formal designs described briefly below.
Each experiment involves a number of experimental units, which can be assigned at random (see below) to a treatment. The experimental unit should also be the unit of statistical analysis. It must be possible, in principle, to assign any two experimental units to different treatments. For this reason, if the treatment is given in the diet and all animals in the same cage therefore have the same diet, the cage of animals (not the individual animals within the cage) is the experimental unit. This situation can cause some problems. In studying the effects of an infection, for example, it may be necessary to house infected animals in one isolator and control animals in another. Strictly, the isolator is then the experimental unit because it was the entity assigned to the treatment and an analysis based on a comparison of individual infected versus noninfected animals would be valid only with the additional assumption (which should be explicitly stated) that animals within a single isolator are no more or no less alike than animals in different isolators. Although individual animals are often the experimental units assigned to the treatments, a crossover experimental design may involve assigning an animal to treatments X, Y, and Z sequentially in random order, in which case the experimental unit is the animal for a period of time. Similarly, if cells from an animal are cultured in a number of dishes that can be assigned to different in vitro treatments, then the dish of cells is the experimental unit.
Split-plot experimental designs have more than one type of experimental unit. For example, cages each containing two mice could be assigned at random to a number of dietary treatments (so the cage is the experimental unit for comparing diets), and the mice within the cage may be given one of two vitamin treatments by injection (so the mice are experimental units for the vitamin effect). In each case, the analysis should reflect the way the randomization was done.
Treatments should be assigned so that each experimental unit has a known, often equal, probability of receiving a given treatment. This process, termed randomization, is essential because there are often sources of variation, known or unknown, which could bias the results. Most statistical packages for computers will produce random numbers within a specified range, which can be used in assigning experimental units to treatments. Some textbooks have tables of random numbers designed for this purpose. Alternatively, treatment assignments can be written on pieces of paper and drawn out of a bag or bowl for each experimental unit (e.g., animal or cage). If possible, the randomization method should ensure that there are predefined numbers in each treatment group.
Note that the different treatment groups should be processed identically throughout the whole experiment. For example, measurements should be made at the same times. Furthermore, animals of different treatment groups should not be housed on different shelves or in different rooms because the environments may be different (see Blinding and Block Designs below).
To avoid bias, experiments should be performed “blind” with respect to the treatments when possible and particularly when there is any subjective element in assessing the results. After the randomized allocation of animals (or other experimental unit) to the treatments, animals, samples, and treatments should be coded until the data are analyzed. For example, when an ingredient is administered in the diet, the different diets can be coded with numbers and/or colors and the cages can be similarly coded to ensure that the correct diet is given to each cage. Animals can be numbered in random order so that at the postmortem examination there will be no indication of the treatment group. Pathologists who read slides from toxicity experiments are often not blinded with respect to treatment group, which can cause problems in the interpretation of the results ( Fairweather et al. 1998 ).
Pilot studies, sometimes involving only a single animal, can be used to test the logistics of a proposed experiment. Slightly larger ones can provide estimates of the means and standard deviations and possibly also some indication of likely response, which can be used in a power analysis to determine sample sizes of future experiments (see below). However, if the pilot experiment is very small, these estimates will be inaccurate.
Formal Experimental Designs
Several formal experimental designs are described in the literature, and most experiments should use one of these designs. The most common are completely randomized, randomized block (see below), and factorial designs; however, Latin square, crossover, repeated measures, split-plot, incomplete block, and sequential designs are also used. These formal designs have been developed to take account of special features and constraints of the experimental material and the nature of the investigation. It is not possible to describe all of the available experimental designs here. They are described in many statistical textbooks.
Investigators are encouraged to name and describe fully the design they used to enable readers to understand exactly what was done. We also recommend including an explanation of a nonstandard design, if used.
Within each type of design there is considerable flexibility in terms of choice of treatments and experimental conditions; however, standardized methods of statistical analysis are usually available. In particular, when experiments produce numerical data, they can often be analyzed using some form of the analysis of variance (ANOVA 1 ).
Completely randomized designs, in which animals (or other experimental units) are assigned to treatments at random, are widely used for animal experiments. The main advantages are simplicity and tolerance of unequal numbers in each group, although balanced numbers are less important now that good statistical software is available for analyzing more complex designs with unequal numbers in each group. However, simple randomization cannot take account of heterogeneity of experimental material or variation (e.g., due to biological rhythms or environment), which cannot be controlled over a period of time.
Randomized complete block designs are used to split an experiment into a number of “mini-experiments” to increase precision and/or take account of some natural structure of the experimental material. With large experiments, it may not be possible to process all of the animals at the same time or house them in the same environment, so it may be better to divide the experiment into smaller blocks that can be handled separately. Typically, a “block” will consist of one or more animals (or other experimental units) that have been assigned at random to each of the different treatment groups. Thus, if there are six different treatments, a block will consist of a multiple of six animals that have been assigned at random to each of the treatments. Blocking thus ensures balance of treatments across the variability represented by the blocks. It may sometimes be desirable to perform within-litter experiments when, for example, comparing transgenic animals with wild-type ones, with each litter being a block. Similarly, when the experimental animals differ excessively in age or weight, it may be best to choose several groups of uniform animals and then assign them to the treatments within the groups. Randomized block designs are often more powerful than completely randomized designs, but their benefits depend on correct analysis, using (usually) a two-way ANOVA without interaction. Note that when there are only two treatments, the block size is two and the resulting data can be analyzed using either a paired t -test or the two-way ANOVA noted above, which are equivalent.
Choice of Dependent Variable(s), Characters, Traits, or Outcomes
Confirmatory experiments normally have one or a few outcomes of interest, also known as dependent variables, which are typically mentioned in the experimental hypotheses. For example, the null hypothesis might be that the experimental treatments do not affect body weight in rats. Ideally there should be very few outcomes of primary interest, but some toxicity experiments involve many dependent variables, any of which may be altered by a toxic chemical. Exploratory experiments often involve many outcomes, such as the thousands of dependent variables in microarray experiments. When there is a choice, quantitative (measurement) data are better than qualitative data (e.g., counts) because the required sample sizes are usually smaller. When there are several correlated outcomes (e.g., organ weights), some type of multivariate statistical analysis may be appropriate.
In some studies, scores such as 0, +, ++, and +++ are used. Such “ordinal” data should normally be analyzed by comparing the number in each category among the different treatment groups, preferably taking the ordering into account. Converting scores to numerical values with means and standard deviations is inappropriate.
Choice of Independent Variables or Treatments
Experiments usually involve the deliberate alteration of some treatment factor such as the dose level of a drug. The treatments may include one or more “controls.” Negative controls may be untreated animals or those treated with a placebo without an active ingredient. The latter is normally more appropriate, although it may be desirable to study both the effect of the active agent and the vehicle, in which case both types of control will be needed. Surgical studies may involve sham-operated controls, which are treated in the same way as the tested animals but without the final surgical treatment.
Positive controls are sometimes used to ensure that the experimental protocols were actually capable of detecting an effect. Failure of these controls to respond might imply, for example, that some of the apparatus was not working correctly. Because these animals may suffer adverse effects, and they may not be necessary to the hypothesis being tested, small numbers may be adequate.
Dose levels should not be so high that they cause unnecessary suffering or unwanted loss of animals. When different doses are being compared, three to approximately six dose levels are usually adequate. If a dose-response relation is being investigated, the dose levels (X-variable) should cover a wide range to obtain a good estimate of the response, although the response may not be linear over a wide range. Dose levels are frequently chosen on a log 2 or log 10 scale. If the aim is to test for linearity, then more than two dose levels must be used. If possible, we recommend using dose levels that are equally spaced on some scale, which may facilitate the statistical analysis. More details of choice of dose levels and dilutions in biological assay are given by Finney (1978) .
Toxicologists often use fractions (e.g., half to a quarter or less) of the maximum tolerated dose (the largest dose that results in only minimal toxic effects) in long-term studies. The scientific validity of using such high dose levels has been questioned because the response to high levels of a toxic chemical may be qualitatively different from the response to low levels ( Fairweather et al. 1998 ). The possibility of exploring the effects of more than one factor (e.g., treatment, time, sex, or strain) using factorial designs (see below) should be considered.
Uncontrolled (Random) Variables
In addition to the treatment variables, there may be a number of random variables that are uncontrollable yet may need to be taken into account in designing an experiment and analyzing the results. For example, circadian rhythms may cause behavior measured in the morning to be different from that measured in the afternoon. Similarly, the experimental material may have some natural structure (e.g., members of a litter of mice may be more similar than animals of different litters). Measurements made by different people or at different times may be slightly different, and reagents may deteriorate over a period of time. If these effects are likely to be large in relation to the outcomes being investigated, it will be necessary to account for them at the design stage (e.g., using a randomized block, Latin square, or other appropriate design) or at the time of the statistical analysis (e.g., using covariance analysis).
Factorial experiments have more than one type of treatment or independent variable (e.g., a drug treatment and the sex of the animals). The aim could be to learn whether there is a response to a drug and whether it is the same in both sexes (i.e., whether the factors interact with or potentiate each other). These designs are often extremely powerful in that they usually provide more information for a given size of experiment than most single factor designs at the cost of increased complexity in the statistical analysis. They are described in most statistical texts (e.g., Cox 1958 ; Montgomery 1997 ).
In some situations, a large number of factors that might influence the results of an experiment can be studied efficiently using more advanced factorial designs. For example, in screening potential drugs, it may be desirable to choose a suitable combination of variables (e.g., presence/absence of the test compound; the sex, strain, age, and diet of the animals; time after treatment; and method of measuring the endpoint). If there were only two levels of each of these variables, then there would be 2 7 = 128 treatment combinations to be explored. Special methods are available for designing such experiments without having to use excessively large numbers of animals ( Cox 1958 : Cox and Reid 2000 ; Montgomery 1997 ). This type of design can also be used to optimize experiments that are used repeatedly with only minor changes in the treatments, such as in drug development, when many different compounds are tested using the same animal model ( Shaw et al. 2002 ).
Deciding how large an experiment needs to be is of critical importance because of the ethical implications of using animals in research. An experiment that is too small may miss biologically important effects, whereas an experiment that is too large wastes animals. Scientists are often asked to justify the numbers of animals they propose to use as part of the ethical review process.
A power analysis is the most common way of determining sample size. The appropriate sample size depends on a mathematical relation between the following (described in more detail below): the (1) effect size of interest, (2) standard deviation (for variables with a quantitative effect), (3) chosen significance level, (4) chosen power, (5) alternative hypothesis, (6) sample size. The investigator generally specifies the first five of these items and these determine the sample size. It is also possible to calculate the power or the effect size if the sample size is fixed (e.g., as a result of restricted resources). The formulae are complex; however, several statistical packages offer power analysis for estimating sample sizes when estimating a single mean or proportion, comparing two means or proportions, or comparing means in an analysis of variance. There are also dedicated packages (e.g., nQuery Advisor [Statistical Solutions, Cork, UK; Elashoff 1997 ]), which have a much wider range of analyses ( Thomas 1997 ). A number of web sites also provide free power analysis calculations for the simpler situations, and the following sites are currently available: < Http://ebook.stat.ucla.edu/cgi-bin/engine.cgi >; http://www.math.yorku.ca/SCS/Demos/power/ ;> and < http://hedwig.mgh.harvard.edu/quan_measur/para_quant.html >. Sample size is considered in more detail by Dell and colleagues in this volume (2002), and Cohen (1988) provides extensive tables and helpful discussion of methods.
Briefly, when only two groups are to be compared, the effect size is the difference in means (for a quantitative character) or proportions (for a qualitative, dead/alive character) that the investigator wants the experiment to be able to detect. For example, the investigator could specify the minimum difference in mean body weight between a control group of rats and a treated group that would be of biological importance and that he/she considers the experiment should be able to detect. It is often convenient to express the effect size “D” in units of standard deviations by dividing through by the standard deviation (discussed below). D is a unitless number that can be compared across different experiments and/or with different outcomes. For example, if the standard deviation of litter size in a particular colony of BALB/c strain mice is 0.8 pups (with a mean of ~5 pups) and an experiment is to be set up to detect a difference in mean litter size between treated and control groups of, for example, 1.0 pups, then D = 1.0/0.8 = 1.25 standard deviation units. If the standard deviation of the total number of pups weaned per cage in a 6-mo breeding cycle is 10 pups (with a mean of ~55 pups) and the experiment is set up to detect a difference between a control group and a treated group of 5.0 pups, then D = 5/10 = 0.5. This effect size is smaller, so would require a larger experiment than the change in litter size would require. Similarly, if a control group is expected to have, for example, 20% of spontaneous tumors, and the compound is a suspected carcinogen, the increase in the percentage of tumors in the treated group (which would be important to be able to detect) must be specified.
The standard deviation among experimental units appropriate to the planned experimental design must be specified (for quantitative characters). For a randomized block or crossover design the appropriate estimate will usually be the square root of the error mean square from an analysis of variance conducted on a previous experiment. When no previous study has been done, a pilot study may be used, although the estimate will not be reliable if the pilot study is very small.
The significance level is the chance of obtaining a false-positive result due to sampling error (known as a Type I error). It is usually set at 5%, although lower levels are sometimes specified.
The practical steps needed for planning and conducting an experiment include: recognizing the goal of the experiment, choice of factors, choice of response, choice of the design, analysis and then drawing conclusions. This pretty much covers the steps involved in the scientific method.
- Recognition and statement of the problem
- Choice of factors, levels, and ranges
- Selection of the response variable(s)
- Choice of design
- Conducting the experiment
- Statistical analysis
- Drawing conclusions, and making recommendations
What this course will deal with primarily is the choice of the design. This focus includes all the related issues about how we handle these factors in conducting our experiments.
We usually talk about "treatment" factors, which are the factors of primary interest to you. In addition to treatment factors, there are nuisance factors which are not your primary focus, but you have to deal with them. Sometimes these are called blocking factors, mainly because we will try to block on these factors to prevent them from influencing the results.
There are other ways that we can categorize factors:
Experimental vs. Classification Factors
Experimental Factors - these are factors that you can specify (and set the levels) and then assign at random as the treatment to the experimental units. Examples would be temperature, level of an additive fertilizer amount per acre, etc.
Classification Factors - can't be changed or assigned, these come as labels on the experimental units. The age and sex of the participants are classification factors which can't be changed or randomly assigned. But you can select individuals from these groups randomly.
Quantitative vs. Qualitative Factors
Quantitative Factors - you can assign any specified level of a quantitative factor. Examples: percent or pH level of a chemical.
Qualitative Factors - have categories which are different types. Examples might be species of a plant or animal, a brand in the marketing field, gender, - these are not ordered or continuous but are arranged perhaps in sets.
Think About It:
Think about your own field of study and jot down several of the factors that are pertinent in your own research area? Into what categories do these fall?
Get statistical thinking involved early when you are preparing to design an experiment! Getting well into an experiment before you have considered these implications can be disastrous. Think and experiment sequentially. Experimentation is a process where what you know informs the design of the next experiment, and what you learn from it becomes the knowledge base to design the next.