The curious case of area measurement in surveys (I)
Innovative data collection methods for agricultural and rural statistics can help small farmers in Southeast Asia.
Co-written with Pamela Lapitan and Rea Jean Tabaco
Economists often take datasets “as given,” without considering data quality and its implications for analysis and policy recommendations. There are two major reasons for this:
- Some do not understand the painstaking process involved in collecting good quality primary data despite (seemingly) knowing the econometric implications of “poor data.” While classical measurement error in continuous dependent variables in a regression-based framework does not bias the parameter estimates, there is the potential for introducing a significant bias when the measurement error lies in independent variables.
- Others may understand data challenges but might not find the marginal investment in trying to collect better data worth their blood and sweat. After all, being able to collect primary data involves a combination of skills – submitting funding proposals (and obtaining them), designing robust samples, preparing questionnaires and other survey tools, hiring and training enumerators, and overseeing data collection and cleaning activities. Analysis is actually an important but a last-mile problem.
But why should we (especially economists) care about data quality, and how technology can be a useful way to improve it? To answer this question, we draw from the preliminary results of an ADB technical assistance project on innovative data collection methods for agricultural and rural statistics in Southeast Asia. The project supports four pilot countries—Lao People’s Democratic Republic, the Philippines, Thailand, and Viet Nam—in their transition to using satellite data for computing rice paddy area and production estimates. We focused on data from the province of Thai Binh in Viet Nam, although we plan to extend this analysis to all our pilot countries.
We looked at a very critical variable in agricultural statistics and analysis, plot size, and utilized three methods of obtaining this variable: (i) directly from farmers; (ii) mapping out the area using a GPS device; and (iii) printing a high-resolution Google Earth satellite image of the study area on paper and requesting farmers to identify plot boundaries subsequently digitized using GIS software. The farmers were asked to identify their plot boundaries on the printed paper and provide their own estimates for plot size prior to conducting the GPS mapping of their plots to avoid biasing the result of the plot size estimates from the farmers.
Then we compared these three independent estimates for the randomly selected plots in the province, under the assumption that the GPS-based estimates are the gold standard, as proposed in a recent working paper by the World Bank.
Distribution of rice plot sizes from different measurements for Thai Binh, Viet Nam
Source: ADB staff estimates based on field validation and crop cutting activities under R-CDTA 8369.
A quick comparison of these distributions seems to indicate that while some minor differences exist, the three datasets appear to be drawn from a similar distribution function. However, to further explore these distributional differences, we generated scatter plots of the three estimates, which indicate a positive relationship between the three estimates (R2 > 0.7). The GPS and Google Earth based estimates are closer in fit (R2 = 0.955) than when they are independently compared with the farmer recall estimates (R2=0.734 and 0.722 respectively).
Scatter plot of farmer-, GPS- and Google Earth-based estimates
Source: ADB staff estimates based on the field validation, crop-cutting activities under R-CDTA 8369.
Some would be tempted to conclude that the three estimates are fairly similar and that farmer-based estimates seem to be a reasonable way to go given that they are obtained at lower cost, are easy to collect, and provide similar results. Others would argue that even if GPS and Google Earth estimates take a lot of time to implement, they are the more reliable options.
Which measure do you think is the most optimal choice? Let us know by leaving a comment below.
Pamela Lapitan is an Economics and Statistics Analyst and Rea Jean Tabaco is a Consultant in ADB’s Economics Research and Regional Cooperation Department