The use of weather data in data science is hugely varied and there are applications in all applied verticals. However at the heart of all data science projects are accurate, trustworthy weather datasets. In this article, we’ll explain the types of weather data available for data science, how to get them, how to make sure they are trustworthy and finally how to incorporate the data into your project so that you can gain the maximum benefit and insight.
Many data science projects involve sourcing raw data to complete the project. Whether the task is data visualization, data analysis, machine learning or other data science activity, all projects start with at least one set of data. If your task involves analyzing or visualizing the weather itself, just a source of weather data may be enough. If your project is to try to analyze how another set of data is influenced or correlates to weather and climate patterns, then you’ll need a dataset that supports accurately joining the data together. If you also want to make predictions, then you’ll need a source of weather forecast data.
Types of Weather Data – Weather Forecast & Historical Weather
Before we dive into the details of how to obtain and use weather data, it’s worth reviewing the types of weather data you will find and how they are typically used.
Historical weather data
Around the world, tens of of thousands of weather stations are continuously monitoring the weather. These individual weather stations provide weather history observations that can be used as the input to a weather forecast or as a record of the weather for us to analyze in our own work. The weather stations report at regular intervals and include weather elements such as temperature, precipication (rain, snow etc), wind, pressure and possible many more weather variables.
Historical weather summaries and climate data
Historical and climate summaries are simply the aggregation of the raw historical weather observation data to provide a picture of what the typical weather for a location has been. For example, we can take many years of historical weather data to calculate that the average temperature in Paris, France for January is 8C/46F. On any given January day the temperature may well be much colder or warmer than 8C but the long term climate average is 8C. Such historical weather summaries can be calculated across years, months, weeks and even days to help create a ‘typical’ weather picture for a location. They can also be used to give a picture of the possible weather extremes by identifying the highest and lowest values of any particular weather metric. For the case of Paris, France we can see that the highest temperature is 16C/61F and the lowest maximum temperature recorded is a chilly -6C/21F.
Weather forecast data
The last type of weather data available is the weather forecast. This provides a detailed prediction of how the weather will behave over the coming days. Weather forecasts typically range from three to 15 days out with the first seven days being the most accurate. Multiple organizations create weather forecasts which all have their own strengths and weaknesses. The most accurate weather forecast data will combine the output from multiple forecast models to provide the best estimate forecast.
Ensuring the Weather Data is suitable for Data Science
Weather data for data science needs to achieve a number of goals. Firstly, the data must be accurate and complete of course. In addition to being in the trustworthy, the data must be in a format that can be used in our chosen Data Science tool, be it R, Excel a database or a custom Python script. Let’s first look at making sure the weather data is complete and accureate. For weather data, “completeness” is required geographically (there is data nearby to the locations I’m investigating) and temporally (there is accurate data for the date and time that I’m interested in).
Spatial and temporal resolution
One of the most important features of the weather data is that it is includes enough spatial and temporal resolution. What does this mean? For historical data it means that it’s necessary to find a source of weather data that is close enough to the point of interest to be considered accurate. If the weather station is too far away from the target location, it is likely that the weather data will not be correct for the location we are analyzing.
Unfortunately there are a limited number of weather stations so it’s necessary to be careful if you load weather history records by simply using the ‘closest’ weather station. It’s better to look at all close by weather stations and combine the results of the weather stations into the best estimate for the location of interest. For example if I am interested in the weather data for Fairfax, Virginia, there are multiple weather stations nearby – Washington Dulles Airport, Washington National Airport and multiple other stations. They are all close, so which should I trust? The answer is to trust all of the stations! By doing this, we validate that we are using the most representative weather values for the data. This technique helps eliminate localized effects such as Washington National Airport tending to report warmer temperatures because of its proximity to the warm Potomac River.
We must also check that the data includes enough day and time accuracy to allow it to be analyzed effectively. Weather data for data science that will be compared to another data set such as business performance data typically requires that there is at least hourly weather data available so that any time-of-day analysis can be performed accurately. It is not possible to analyze hourly business data if the weather data is only reported as a daily summary – rain at night often has no bearing on actitivies during the daytime.
Error checking and observation completeness
Another problem with historical data is that weather stations don’t report the weather occasionally due to equipment failure, planned outages or other causes. Using multiple stations helps mitigate this problem by being able to fall back to alternative weather stations if other stations did not report certain hourly records.
The final part of the historical data validation is to identify errors in the weather station observations. Unfortunately some weather observations include errors in addition to omissions. When using weather observations within data science applications, it’s important to have an understanding of whether the data has been analyzed and cleansed. If so what procedures have been followed?
Technologies available to import the weather data
Once we have identified a reliable source of weather data, we can now prepare to obtain the data so we can integrate it within our data science application. There are a number of typical ways to retrieve weather data.
Downloadable data files
Some data provides will provide you with weather data in a flat file format. These formats may be in a standard format such as Comma Separated Values (CSV) format. In this case, the data can often be used or imported directly into a Data Science project. Other data providers will provide raw historical data in a format that is not so easily read into normal data science tools. This is typical for raw data from government meteorological departments because weather observation data can be very large. Earth Science have developed data formats such as GRIB and NetCDF to allow such large data volumes to be processed.
Raw historical weather data sets such as this will generally require pre-processing to be able to use them in a data science project. In addition, most will not include full error checking or the ability to interpolate the observations from multiple nearby weather stations.
Commercial weather data providers (some of which include a free trial), will often perform additional data processing and formatting that makes the data more easily consumed into a data science project. Data formats can include CSV, JSON or other plain text file format.
Weather API for automated data retrieval and loading
Downloading weather data and then importing into the database or other application is generally suitable for a small number of data loads. For example importing a fixed amount of historical weather data for analysis with a fixed set of dates. Unfortunately, it is often too restrictive to deal only with a fixed set of weather data. Weather forecasts for example are generally needed to be refreshed at least daily. If the application uses recent weather history, that too can need a regular refresh. The latter can often require an update at least every hour so that the dashboard, visualization or other output is able to display the very latest information.
To achieve such a frequent data refresh requires an automated procedure to retrieve the weather data and then load it into the application. One of the best ways to do this is via a web service API. Web services use the same technologies as the general world wide web (HTTP and HTTPS network protcols) to transfer the weather data from provider to the client.
Many applications that are used for Data Science can import data directly from such web services. For example Microsoft Excel, Power BI and many business intelligence applications such as MicroStrategy and Tableau are able to read information from a web service directly. In this case, it’s often useful to have a weather data provider that can supply the weather data in a standard data format such as a CSV so that the application can easily import the data.
ODATA (short for Open Data Protocol) is a standard form of RESTful web service that some applications such as Excel and SAP Analytics Cloud support. ODATA acts just like a web service except that the exchange format is very formalized so that data science clients know how to consume the data without any modifications. If your application and data provider both support ODATA then this will provide an easy path to importing the weather data into your application.
Accurately analyzing weather data
We have now found and imported our weather data into the application so we can perform our data science. Now let’s look at the typical data we can expect within a weather data report.
The typical weather data dataset will include multiple columns such as temperature, precipitation and wind speed etc. If the dataset is using a short time period for each item of data such as an hourly weather forecast or hourly historical observations, then not a lot of post processing will occur on the dataset that is imported. However if the data is aggregated to summarize a day, month or year then multiple weather data observations are aggregated into a single report. There are different ways for to happen and it’s important to understand what a particular aggregated weather variable value is obtained.
Temperature is typically aggregated in three ways – the maximum temperature, minimum temperature and the arithmetic mean of the temperature. The mean temperature can be the mean of all hourly values or the average of the maximum and minimum values. In a typical day, the maximum temperature often occurs in the afternoon and therefore simply reporting the maximum temperature is a good substitute for the overall temperature of day. In some circumstances, maximum temperature does not occur in the afternoon such as when an colder or warmer air masses are moving through a location. In these cases, using the maximum daily temperature to compare against business performance may not produce accurate analysis and results. It is generally better to analyze temperature at the hourly level.
For some applications it is necessary understand more about the maximum and minimum temperature when investigating the typical weather for a time and location. For example, consider the normal temperature for a location in January. We would like to understand the normal maximum temperature (mean maximum temperature) plus also the possible variability. What temperature range do 80% of the days fall between? What is the maximum maximum or minimum maximum temperature possible at this location? The maximum temperature is a good guide to the typical weather at a location, particularly when additional statistical values are considered so a full understanding of the typical temperature and the variability is understood.
Rainfall is typically summed over the aggregation period. The precipitation coverage, the amount of time the rain fell for, is often as important a driver for business metrics as the amount of rain that falls. A short, sharp but heavy thunderstorm at the end of the day in Miami, Florida may well have less impact on tourist activities than a longer but lighter all day light rain. However the former may well produce significantly more rainfall and therefore look worse in the daily weather observation data.
Sources of Weather Data
Get started with our free trial and free data tier at Weather Data Services.