Introduction

Measuring the impact of scientific articles is of interest to authors and readers, as well as to tenure and promotion committees, grant proposal review committees, and officials involved in the funding of science. The number of citations by other articles is at present the gold standard for evaluation of the impact of an individual scientific article. Online journals offer another measure of impact: the number of unique downloads of an article (by unique downloads we mean the first download of the PDF of an article by a particular individual). Since May 2007,

*Journal of Vision*has published download counts for each individual article. So far as we know, we are the only scientific journal providing these numbers. In the most recent accounting in July, 2008, the top five articles were each downloaded between 1,993 and 3,478 times. While we cannot equate download of an article with actually reading it, these are nonetheless remarkable numbers. The reader may wonder how total downloads of an article compare with the more traditional measures of citation count. Elsewhere I and others have discussed the differences between, and advantages and disadvantages, of download and citation counts (Watson, 2007) (Brody, Harnad, & Carr, 2006; Deciphering citation statistics,” 2008; Perneger, 2004). In this note, I discuss the degree of correlation between these two measures.Before proceeding to the analyses, it is worth contemplating potential outcomes. Since downloads and citations are in some respects complementary measures, we should not expect perfect correlation. But substantial correlation, joined to the fact that downloads generally precede citations, would mean they provide a useful early predictor of eventual citations.

Methods

The data in this report were collected from two sources. The first is our own collection of log files at the

*Journal of Vision*. The logs cover the interval from October 23, 2003 to July 1, 2008. The log files were analyzed to extract unique PDF downloads as a function of time since publication. A unique download is the first download of a particular paper by a particular reader. In the remainder of this paper, “downloads” refers to unique PDF downloads.The second data source is citation counts for all

*Journal of Vision*papers collected from Scopus on July 18, 2008. Scopus is a large abstract and citation database of research literature (http://www.scopus.com/). Our Scopus data consist of counts of citations for each article occurring in each calendar year from 2001 to 2008. We excluded data corresponding to editorials and errata.Results

Total citations vs total downloads

Our first comparison is between the total downloads and total citations. This is shown in Figure 1, in which we plot the two quantities against one another (we add 1 to citations to allow it to be plotted on a log scale). The correlation between these two quantities is 0.74, indicating a strong positive relationship.

Figure 1

Figure 1

The data in Figure 1 correspond to papers that vary in age from 0 to 7 years. Since papers garner both downloads and citations over time, it is possible that much of the association shown in Figure 1 is due to growth with age. To examine this effect, we first looked at the growth of citations and downloads with time following publication.

Citations vs age

The citation data are already binned into
calendar years, so our analysis is coarse. In Figure 2, we plot the average cumulative number of citations
per article as a function of article age. The number climbs steadily to about 18
citations after 6 years. Estimates are less certain for the oldest articles
because fewer papers contribute to the estimate, but there is as yet no evidence
of an asymptote.

Figure 2

Figure 2

Downloads vs age

For the download data, we know the date and time of each download, so the analysis can be performed on a finer time scale. We counted unique downloads for each article in bins of one week from date of publication until the date of the last record (July 1, 2008). For each week, we averaged the count over all articles in existence in that week. The result is shown by the black curve in Figure 3. The remarkably smooth curve shows a rapid initial climb followed by a more gradual rise, reaching a value of about 1000 after 7 years. Elsewhere, we have noted a similar shape that characterizes the growth in downloads for individual articles (Watson, 2007).

Figure 3

Figure 3

An obvious difference between downloads and citations is that the former can occur the moment the article is published, while citations inevitably lag by at least the time required to write and publish an article. That difference aside, both quantities rise systematically with article age. In fact, the rate of growth is quite comparable, once the lag is accounted for. To show this, in Figure 3 we also re-plot as red points the data from Figure 2, advanced in time by 2 years, and multiplied by 45. Loosely described, on average, about 45 downloads correspond to one citation about 2 years later.

Citations vs downloads for papers published in a given year

To neutralize the growth with age, we can compare the total downloads and citations (as of July 1, 2008) for papers published in a given year. This analysis is shown in Figure 4, which shows the correlation between total downloads and total citations for papers published in each year. The figure shows a strong positive correlation in each year, with a high of around 0.8 in 2003. Because of the lag between downloads and citations noted above, we should not expect correlations to be as high for articles less than three years old. In articles at least three years old, the correlation is always above 0.6 (except for 2001, which is based on only 12 papers). Recall that total downloads are not accurate for papers prior to 2004, because only logs after October 2003, were available. We shall have to wait for several more years to determine whether the correlation continues to climb with age, and at what age and value it might asymptote Figure 5.

Figure 4

Figure 4

Figure 5

Figure 5

The

*Journal of Vision*is, of course, a very young journal. As can be seen in Figure 4, the number of papers published each year has changed markedly over our lifespan. And our lifespan coincides with a period of radical change in the methods and habits of publication and consumption of scientific articles. Consequently patterns of submission, citation, and download have changed markedly over the eight years of our existence. Changing correlations over time between citations and downloads may reflect these other changes as well.CiteRate vs DemandFactor

To this point we have compared two statistics for individual papers: total downloads and total citations. As we have noted, both of these grow over time subsequent to publication, which limits the usefulness of the raw statistics in comparing papers of different ages. However, both statistics can be normalized for age. In the case of downloads, we have proposed the DemandFactor, which corresponds to the number of downloads per day over the first 1000 days of an article's lifetime (Watson, 2007). In the case of citations, we can count citations in some interval of time following publication. This is reminiscent of the impact factor, but for individual articles, and generalized with respect to the interval in which citations are counted.

Recall that in our dataset the citations are binned into counts per calendar year for each article. Thus the interval in which citations are counted must be an integer number of years. We characterized the interval by a lag (years after the year of publication) and a length (years included in the count) and explored lags of 0 to 3 years, and lengths of 1 to 5 years (where possible). We computed the number of citations within the interval, and divided by the length of the interval. We call this CiteRate, and it has units of citations/year.

Since citation rates for individual articles have not been widely described or analyzed, we show the distribution of CiteRate at

*Journal of Vision,*for a lag of 2 and length of 3. The median of this distribution is 3, and the mean is 4.2682. For comparison, the 2006 Impact Factor of the*Journal of Vision*was 3.753. That describes citations in 2006 of articles published in 2004 and 2005.Figure 6

Figure 6

We plot the correlation against article age at the end of the counting interval (length + lag). The red curve shows that correlation grows steadily with article age, reaching a value of 0.62 after five years. The other curves show that it does not make much difference how long after publication we wait to begin counting. This value is essentially the same as the correlation between total citations and total downloads for year 2004 shown in Figure 4.

To summarize, DemandFactor correlates strongly with CiteRate (

*r*= 0.62), measured over an interval of five years after publication. It is possible that the correlation continues to climb for even larger values of article age. This is useful, since DemandFactor, unlike total downloads, can be used to compare articles irrespective of age.Discussion

Our study confirms and extends a number of previous reports relating online usage and citation statistics. The earliest report measured “hits,” during the first week following publication, of the HTML full text articles published in a single volume (1999) of the

*British Medical Journal,*and compared those with citations of the same articles as of May 2004 (Perneger, 2004). For this set of 153 papers, a correlation between logs of 0.54 was found. An analysis of downloads (from a UK mirror site) and later citations of physics articles deposited in a large preprint archive (arXiv.org) showed an asymptotic correlation of 0.46 (Brody et al., 2006). For*Nature Neuroscience*articles published in 2005, a correlation of 0.72 was found between citations as of March 2008 and PDF downloads in the first 180 days of an article's lifetime (“Deciphering citation statistics,” 2008).Conclusions

- Overall correlation between total downloads and total citations of
*Journal of Vision*articles is 0.74. - Citations and downloads increase with article age in a characteristic way, but relative to downloads, citations are delayed by about 2 years and reduced by a factor of about 45.
- For papers published in a single year, the correlation is as high as 0.8, and usually above 0.6.
- The correlation between age-normalized statistics of DemandFactor (downloads/year) and CiteFactor (citations/year) is about 0.62.
- Download statistics provide a useful indicator, two years in advance, of eventual citations. Downloads are also a useful measure in their own right of the interest and significance of individual articles.

Acknowledgments

This work was supported in part by NASA's Space Human Factors Engineering Project, WBS 466199 and by NASA/FAA Interagency Agreement DTFAWA-08-X-80023.

Commercial relationships: none.

Corresponding author: Andrew B. Watson.

Email: andrew.b.watson@nasa.gov.

Address: MS 262-2 NASA Ames Research Center, Moffett Field, CA 94035, USA.

References

Brody, T.
Harnad, S.
Carr, L.
(2006). Earlier Web usage statistics as predictors of later citation impact. Journal of the American Society for Information Science and Technology, 57, 1060–1072. [CrossRef]

Perneger, T. V.
(2004). Relation between online “hit counts” and subsequent citations: Prospective study of research papers in the BMJ. BMJ, 329, 546–547. [PubMed] [Article] [CrossRef] [PubMed]

Watson, A. B.
(2007). Measuring demand for online articles at the Journal of Vision. Journal of Vision, 7, (7):,