The importance of understanding the weather data we use

The challenges around managing data, using something as simple as a rain forecasting as an example.

Introduction

With a wide range of weather data now freely available online, it is becoming more and more widely accessible, using a few lines of code to access simple to use APIs. This opens up a huge area of data analytics, and allows for a wider understanding of the impacts of weather conditions in a whole host of new areas. This in turn, leads to easier, better informed decision making for both infrastructure running and maintenance, and individual choices on a daily basis. With the click of a button we can see the weather conditions where we are, but understanding that data, and how it is sourced is very important.

Why?

Take for example the percentage of rainfall on any given weather app. For example, in the image below we see that on Tuesday, it has a 40% chance of rain, but what does this actually mean? Typically the ‘probability of precipitation’ is the confidence that it will rain, multiplied by the area this is measured over.

The 40% chance of rainfall on Tuesday could mean there is a 40% chance of rainfall over the whole area measured, but typically rainfall isn’t that evenly distributed in space. There could also be an 80% chance of rainfall, in 50% of the area, and the probability of rainfall would still be 40%. This isn’t that helpful for either the dry or wet half of the area! In most cases, this could just mean you’re caught in a shower without a waterproof, but if say you are a farmer leaving your crops out to dry, this could have very costly consequences. It is therefore very important to understand the resolution of the weather data.

Example: rainfall

Each type of weather data (e.g. wind, rainfall, snowfall, temperature) all has different individual factors to consider, but let us consider rainfall for example. Typically rainfall is provided as gridded data, which is a combination of different data sources, including weather radar, rain gauges and satellite data.

  • Whilst weather radars provide volume average estimates, the size of each these changes depending on the distance from the radar. The end rainfall product provided from weather services is usually gridded to 1km^2 pixels, for practical uses. Despite this, due to the way weather radars work, a pixel close to the radar will be an average of several observations, all with a small volume, whereas a pixel about 100km from a radar will be a single observation with a large volume. Weather radars are also subject to lots of errors, including ground clutter (e.g. where the radar beam hits buildings or mountains) and overshooting rainfall (i.e. when the rainfall is missed as the radar beam is measuring higher than the actual rainfall).
  • Rain gauges work by measuring rainfall at ground level, with a small ‘funnel’ type dish with a size of less than 10cm^2. This data may also be gridded, using interpolation methods which estimate rainfall in areas where there are no measurements. These observations are typically less prone to errors than weather radar, but the network of rainfall gauges are very sparsely distributed across the county, and so typically they miss the spatial distribution of rainfall. They can also be affected by wind, put in places where not sensible (e.g. under a tree), get blocked by debris (in rural areas) or vandilised (in urban areas), resulting in missing data.
  • It is important to consider where the data has come from - it may be gridded to 1km^2, but it may be based on a handful of working stations, and the nearest one could be very far away, leading to high uncertainty in estimates, particularly in rural areas, which are generally where the information can be crucial.

What questions should we be asking about our data?

The data is never going to be perfect, but primarily we want to answer the question “Is this data reliable enough for what we want to use it for?” This will be dependent on what the uses of the data are. Before using it for modelling, it is good to ask a few simple questions about the data you will be using, making full use of any documentation available. It is also good, if providing data, to also make sure any users of the data are provided with answers to these questions.

  1. Where is the data from? How is the data measured, at what resolution?
  2. How has the data been processed? Is it raw data? Has any quality control taken place?
  3. Is the quality good enough for what it is being used for?
  4. Where might errors in the data arise? How can we mitigate this? Are there any preliminary checks I need to be doing before using it?

Summary

Weather data is extremely useful, and has changed the world we live in, as well as allowing for better preparation against extreme events. There is a wealth of information available, with big data sets, and in many areas of industry and practice, it is great that this is becoming more accessible. Despite this, understanding the data we use is best practice, and can result in more robust and trustworthy results. With a few simple questions, we can not only understand the data more, but draw better conclusions from it.