By Steven

2014-11-06 01:05:07 8 Comments

I'm relatively new to GIS and QGIS but I have a problem that I'm not certain how to solve. I searched here for relevant information but found none. (Or, I used incorrect search terms.) My problem is thus:

I have a raster file of crop data determined from satellite imagery. Pixel size is 30 meter x 30 meter.

I have a shapefile with over 270,000 polygons of varying sizes.

I can get general 'zonal statistics' through QGIS but these are relatively meaningless: count, mean, and sum don't help me in determining the predominant--by area--crop type within a given polygon.

Ideally, I'd like to be able to determine which crop type covers the highest area in every polygon. As an example, in the image below, the parcel in the center of the image has at least six crop types within it's boundaries. However, clearly there is one predominant crop type. How can I associate that single crop type with that parcel? Is this possible?

An example image

I'm using QGIS version 2.4.0 on Windows 7.

If there is an R solution, I would enjoy hearing that, too.


@Steven 2014-12-11 16:18:13

I ended up using an R solution almost identical to that provided by @JeffreyEvans.

First, I was able to pare down the number of polygons of my shapefile I was genuinely interested in. I pared these data down by eliminating those polygons that were outside of city limits in the county in which I was interested. This was a simple process done in QGIS to intersect my parcel layer with a city limits layer.

I used the extract() function from the Raster package with the below function for mathematical Mode to extract the singular value for each polygon (of only ~22,000 rather than ~290,000):

  Mode <- function(x, na.rm = FALSE) {
      x = subset(x, !

    ux <- unique(x)
    return(ux[which.max(tabulate(match(x, ux)))])

(I wrote this Mode() function to avoid the density() function returning values that were not integers.)

Then it was just an issue of applying the extract() function with my raster and SpatialPolygonsDataFrame, and Mode() function as values:

    ep <-, myShapefile, fun = Mode, na.rm = T))
    names(ep) <- c("cropValue")

then inserting the crop data back into my SpatialPolygonsDataFrame:

    [email protected]$cropValue <- ep[,1]

I timed how long the data processing took. Initially, I ran the extract function on only the first 10 and 100 rows of my data. For the original SpatialPolygonsDataFrame, I calculated that it would have taken > 100 hours to do these analyses. I contemplated running it on a super-fast virtual machine via AWS EC2 but was able to pare down the spatial data frame to < 10% of its initial size. Still, running the above function on only ~22,000 polygons still took > 6 hours on my moderately fast desktop.

Thanks for the tips and advice! I was unable to discover a QGIS solution but was intrigued enough to start learning Python to remedy that in the future. However, R did the job.

Thanks again.

@Jeffrey Evans 2014-11-07 21:13:23

Here is an R solution that uses the mode of the observed distribution.

Create some example data

  r <- raster(ncol=36, nrow=18)
    r[] <- sample(1:3, ncell(r), replace=TRUE)
      polys <- SpatialPolygons(list(Polygons(list(Polygon(rbind(c(-180,-20),c(-160,5),
                 c(-60, 0),c(-160,-60),c(-180,-20)))), 1), 
                 c(120,-55),c(80,0)))), 2)))
    polys <- SpatialPolygonsDataFrame(polys, data.frame(row.names=c("1","2"),
  plot(polys, col="red", add=TRUE)

Function to calculate peak mode of distribution. This could alternatively, be done by fitting a spline to x. The density approach is just a nice shortcut.

dmode <- function(x) {
  den <- density(x, kernel=c("gaussian"))
    ( den$x[den$y==max(den$y)] )   

Here we extract values for all polygons and calculate mode value using the dmode function. Note; the base R function "mode" returns the storage type of the object class (e.g., numeric, character) and not the distributional mode. Since the mode is a function of the peak distribution we need to round it back to a whole number so it corresponds to a "real" value in the observed distribution. If the raster data were floating point or you wanted a true hinge point, then this would not be necessary.

( ep <- extract(r, polys) )
  ep <- lapply(sp, function(x) x[!]) # remove NA values
    ( x <- round(unlist(lapply(ep, FUN=dmode)),digits=0) )

Since you have a large number of polygons you may want to implement a memory safe for loop that processes one polygon at a time. However, this may be slow.

x <- vector()
  for(i in 1:nrow(polys)){
    p <- polys[i,]
      v <- extract(r, p)
        v <- lapply(v, function(x) x[!]) 
          x <- append(x, round(unlist(lapply(v, FUN=dmode)),digits=0) )
      cat("Mode(s):", x,  "\n")

You can then join the resulting "mode" values back to each polygon and plot results.

[email protected]$MajClass <- x 
cols <- ifelse([email protected]$MajClass == 1, "red",
          ifelse([email protected]$MajClass == 2, "green",
            ifelse([email protected]$MajClass == 3, "blue", NA)))
  plot(r, legend=FALSE)
    plot(polys, col=cols, add=TRUE)
      legend("bottomright", legend=c(unique(x)), fill=cols)

I would add, another efficient way to calculate frequencies is the table function. Let's create a vector of 1,2,3, the table function will then return counts of each value in the vector.

x <- c(1,1,1,2,2,3,3,3,3,3)

We can then return the class name associated with the most frequent value.


In the context of the problem at hand, we can use lapply function to the extracted raster values.

( ep <- extract(r, polys) )
  ep <- lapply(sp, function(x) x[!])
    ( x <- unlist(lapply(ep, FUN=function(x) { as.numeric(names(which.max(table(x)))) })) ) 

An exact frequency "count" approach will give a more precise answer. Let's take an example where extreme values are existing in the same polygon and compare mode and count.


@dof1985 2014-11-07 19:59:14

Update: I have thought of a simpler way. Yet it has its own risks ivolving rasterizing your polygons. The original suggestion is below this one.

Start with rasterizing your parcel polygon layer. Than use r.statsitics from the GRASS plugin in QGIS (documentation) to compute the mode of the crops raster in each one of the parcels raster (each of them must have a unique data within it, base on ID of the vector layer). After that you might either vectorize the result again, this time each raster has the attribute of its dominant (mode) crop, or you can use zonal statistics with the resulting (statistics) raster and your original polygon. For that second option, it would be easier to conclude from the mean, min and max what is the dominant crop within each polygon (if the rasterized polygons and the vector polygins will be perfectlly overlayed, than all statistics will be equal to the crop code).

This is a bit of a work around soltuion, since ArcGIS is out of hand for you. What you are looking acctually is the mode of the ratster attributes in each one of your polygons, e.g. land parcles.

You might use model builder or a script to loop the following steps:

  1. For each crop type, between 1 to n; extract a new binary raster using the raster calculator, e.g. for crop number 1 (wheat for example) use th syntax [email protected]=1.
  2. Use Zonal statistics with each binary raster crop_1 to crop_n use zonal statistics to calculate the count of the crop within each polygon. You might prefer to call the column "crop1_cnt","crop2_cnt", etc.
  3. Last, use the field calculator with a CASE...ELSE expressions to feed the dominant (or mode) crop for each polygon; literally for each row find the highest count of all crop counts and write the corresponding crop code to a new dominant_crop column.

It might be complex, but I couldn't find a tool that calculate the "mode" for zonal statistics in grass or qgis.

@dof1985 2014-11-07 20:09:01

A R soulution might be found here:…; yet, a mode function should be written.

@Jeffrey Evans 2014-11-07 20:15:49

The mode is the peak value of a distribution regardless of it representing the central tendency of the distribution and makes very little sense in this regard. I believe that the question is, in fact, wanting the majority class associated with crop types in each polygon.

@dof1985 2014-11-07 20:18:46

@JeffreyEvans you are right, however taking into account the crop categories are infact a nominal variable, the mode (or the most frequent category) within a polygon (i.e. parcel) will also represent the majority.

@Steven 2014-11-07 20:36:29

@JeffreyEvans, in the raster image, crop type is represented by a numeric value between 1 and 255. The mean is useless as discussed above, the median would be equally worthless, but the mode would represent which crop type is most frequent in the distribution of crop types in a given parcel. You're right that the question is to determine the majority class within each polygon. The mathematical solution is simply the mode. However,the answer provided by dof1985 presents another solution that I'm about to try.

@Steven 2014-11-07 20:37:51

@dof1985, I'm working on that R solution as well. The function for 'Mode' is the easy part. For some reason, my raster is being read into R with no data. Identifying why that's the case is the difficult part. Thanks for the answer you've provided; I'm going to try that now.

@dof1985 2014-11-07 20:38:00

@Steven. Hope one of those will work for you. Pleas update with a solution, I'm curios...Good luck

@Jeffrey Evans 2014-11-07 20:48:43

So, you have 255 crop types? This seems to be representing something other than "type". What exactly happens if a given polygon exhibits a bimodal distribution?

@Steven 2014-11-07 23:15:28

I'll update with a solution after I've tried those presented here. I don't think it'll be by this evening, but I'll keep everyone updated. @JeffreyEvans: Thanks for the R solution. I'm more comfortable in R than GIS. And yes, the raster data downloaded from CropScape (USDA) provides 255 different values for crops (rather, specific commodities) though upon closer examination, 123 of those 255 are 'blank.' If a bimodal distribution, I'm in a tough spot, I guess. My client knows that the data from the USDA is messy and we may have to adjust accordingly.

@Jeffrey Evans 2014-11-07 23:19:04

The "dmode" function that I provided in my example will not have an issue with a bimodal distributions, it will just pick the mode with the higher peak value.

@Steven 2014-11-07 23:22:02

Interesting fact: @JeffreyEvans, you and I have played Irish music together in Laramie! I'm a beginner guitar player and sat in on some sessions with you on Front Street! This, however, is unrelated to the topic at hand!

@dof1985 2014-11-08 17:17:52

@Steven, the raster you downloaded has values between 0 - 255, since it had an 8-bit radiometric resolution, namely gray levels (or intensity of refelctance caught by satellite sensor) can be represented by a number between 0 (low intensity) to 255 (high intensity). However, as you have already observed, it doesn't necessarly mean that it uses all the radiometric range. In addition, you might prefer to use categorial symbology on a min-max strech for crop types data.

Related Questions

Sponsored Content

1 Answered Questions

Determine area of cell in raster (QGIS)

0 Answered Questions

Calculating the area of raster within polygon

1 Answered Questions

[SOLVED] How can I modify DEM elevation data within QGIS using a polygon?

1 Answered Questions

[SOLVED] How to plot a point on the highest elevation of a raster on a polygon. QGIS

0 Answered Questions

Microstation Elements created in QGIS

0 Answered Questions

Weighted Average of several raster "zones"

1 Answered Questions

[SOLVED] Avoid Intersections not honored when saving in QGIS?

1 Answered Questions

[SOLVED] Spatial query, embed in layer or project

Sponsored Content