Prepare a tab-delimited file in this format:
| ORF | Exp 1 | Exp 2 | Exp 3 | Exp 4 | ... |
| YKR005C | 0.1 | -0.1 | 0.4 | ... | ... |
| YKR006C | 1.45 | NaN | -1.5 | ... | ... |
| YKR007W | NaN | 0.28 | -2.7 | ... | ... |
| YKR008W | 0.52 | 0.26 | -1.4 | ... | ... |
| YKL225W | 0.9 | -1.96 | 0.35 | ... | ... |
| YKR009C | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... |
Leave missing entries blank or fill them with NaN. The data should be pre-processed and normalized. Please kindly fill in your name, email address, and affiliation so that we may contact you when improvement to this tool is made.
It takes some time to impute the data. The computation time is linearly proportional to the number of columns and the number of missing entries. For yeast, out of 6,200+ ORFs, one column with 1,200 missing entries will take less than 30 seconds to finish, if the load on the server is light. This tool is free of charge for academic research. Please contact Ming Ouyang, ouyangmi AT umdnj DOT edu, for commercial uses.
If this tool helps with your work, please cite:
Jörnsten R,
Ouyang M, Wang HY,
A meta-data based method for DNA microarray imputation, BMC
Bioinformatics, in press.
Abstract:
DNA microarray
experiments are conducted in logical sets, such as time course
profiling after a treatment is applied to the samples, or
comparisons of the samples under two or more conditions. Due to
cost and design constraints of spotted cDNA microarray experiments,
each logical set commonly includes only a small number of replicates
per condition. Despite the vast improvement of the microarray
technology in recent years, missing values are prevalent.
Intuitively, imputation of missing values is best done using many
replicates within the same logical set. In practice, there are few
replicates and thus reliable imputation within logical sets is
difficult. However, it is in the case of few replicates that the
presence of missing values, and how they are imputed, can have the
most profound impact on the outcome of downstream analyses
(e.g. significance analysis and clustering). This study explores
the feasibility of imputation across logical sets, using the vast
amount of publicly available microarray data to improve imputation
reliability in the small sample size setting.
We download all cDNA microarray data of Saccharomyces cerevisiae,
Arabidopsis thaliana, and Caenorhabditis elegans from the Stanford
Microarray Database. Through cross-validation and simulation, we
find that, for all three species, our proposed meta-data based
imputation across logical sets is far superior to imputation within
a set, and sometimes to an astonishing degree. Furthermore, the
imputation root mean square error for significant genes is generally
a lot less than that of non-significant ones. Since downstream
analysis of significant genes, such as clustering and network
analysis, can be very sensitive to small perturbations of estimated
gene effects, it is highly recommended that researchers apply
reliable data imputation prior to further analysis. Our method is
applicable to cDNA microarray experiments from other species,
provided a good collection of meta-data is available in the public
domain.