Dealing With Vehicle Identification Numbers (VIN) #Data Quality

#DWH #BI #Automotive

The vehicle identification number (VIN) is the central identifier for vehicles in the automotive industry. When running an automotive project, you will have to deal with these 17 digit long numbers.

Example

A hypothetical example for a vehicle identification number would be

1M8GDM9AXKP042788

Notice the mix of characters and numbers, which is explained in the next section.

VIN Format

Every VIN is built using international standards, the ISO 3779 (Vehicle Identification Numbers) and ISO 3780 (World Manufacturer Identifier) standards.

For manufacturers who produce more than 500 vehicles per year, the VIN is divided into the following sections:

  • World Manufacturer Identifier (WMI)
  • Vehicle Descriptor Section (VDS)
  • Vehicle Identifier Section (VIS)
  • Check Digit

Wikipedia contributors. "Vehicle Identification Number." Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 26 Jul. 2012. Web. 2 Aug. 2012.

World Manufacturer Identifier (WMI)

The WMI code identifies the manufacturer of the vehicle. The WMI code is using 3 characters. This ID has been assigned by the National Highway Traffic Safety Administration (NHTSA) to each manufacturer. Note that many manufacturers have multiple WMI codes.

For example, Volkswagen has the following WMI codes assigned by the NHTSA:

  • Volkswagen AG: WV0, WV1, WV2, WV3, WV4, WV5, WV6, WV7, WV8, WVA, WVB, WVC, WVD, WVE, WVF, WVG, WVH, WVJ, WVK, WVL, WVN, WVW
  • Volkswagen Argentina SA: 8AV, 8AW, 8AX, 8AY, 8AZ
  • Volkswagen Bruxelles SA: YBW
  • Volkswagen Caminhoes E Onibus Ind.Com: 953
  • Volkswagen de Mexico SA de CV: 3VV, 3VW
  • Volkswagen do Brasil Industria de: 9BW
  • Volkswagen do Brasil Ltda: 93U, 9BW, 9DW
  • Volkswagen Group of America Inc: 1V1, 1V2, 1V3, 1VW
  • Volkswagen Individual: W09
  • Volkswagen of South Africa Pty Ltd: AAA, AAV

Vehicle Descriptor Section (VDS)

Positions 4 to 8 of the VIN are used for the Vehicle Descriptor Section (VDS). The VDS is used to identify the vehicle type, and may include information on the automobile platform used, the model, and the body style. The VDS is used by the manufacturers according to their own standards. However, most manufacturers are using the 8th digit to identify the engine type of the model.

Vehicle Identifier Section (VIS)

Position 10 to 17 are used for the Vehicle Identifier Section (VIS). This section identifies the individual vehicle. It may include information on options installed or engine and transmission choices, but often is a simple sequential number.

Check Digit

The character on position 9 is used as a check digit for VINs.  This check digit is covered in one of the subsequent sections.

Data Quality

As with all data from operating systems, the data quality of these numbers is mixed. If all data would come from automated systems, the quality would be higher, but oftentimes, users are inputting these 17 characters by hand. This manual work results in a variety of errors, for example

  • VINs of incorrect length (too short or too long)
  • VINs of incorrect format
  • Invalid VINs
  • etc.

Before loading the data into a data warehouse, it is important to check the VINs and identify incorrect / invalid records. How you deal with incorrect data depends on your requirements and is out of the scope of this article. However, feel free to ask questions at the end of the article.

The next sections present some common checks to VINs which are used regularly.

Check 1: String Length

The first, yet most simple check is to validate the length of the VIN string. The defined length of a VIN is 17 characters. A VIN shorter or longer is invalid.

This check is the quickest and should be performed first.

Check 2: VIN Format using Regular Expression

While there are competing standards used in various regions, all vehicle identification numbers follow a given format. This format can be checked using the following regular expression (which has been found on the Regular Expression Library):

^(([a-h,A-H,j-n,J-N,p-z,P-Z,0-9]{9})([a-h,A-H,j-n,J-N,p,P,r-t,R-T,v-z,V-Z,0-9])([a-h,A-H,j-n,J-N,p-z,P-Z,0-9])(d{6}))$

To increase the validity of the regular expression, you could create separate regular expressions which are used depending on the region of the world manufacturer identifier (next section). However, in this example, we will stick to the general regular expression provided above.

Check 3: World Manufacturer Identifier

Another valid check covers the world manufacturer identifier (WMI) at the beginning of the VIN. We can use a Microsoft Access database with all assigned WMI codes, provided by the NTHSA to check the first three characters of the VIN against the assigned WMI codes.

Check 4: VIN Check Digit

The most sophisticated check can be performed using the VIN check digit which is part of the VIN. The check digit is the character on position 9 and can be a number from 0 to 9 and X (for 10). However, check digits are not compulsory for all VINs. Only North America and selected countries or manufacturers are using the check digit. For example, Citroën, BMW, Renault, Audi, Korean Chevrolets, Fiat and European Fords are not using the check digit, so a check for this number is not possible. The same applies to Volkswagen commercial cars.

Instead of the check digit, Volkswagen commercial cars and other manufacturers use a “invalid” check digit, e.g. the character Z (in the case of Volkswagen commercial vehicles) or A. When checking the check digit, we could first check if the check digit is a valid character (0-9, X). If this is the case, perform the check, otherwise skip it. This can be done using the following regular expression:

 w{8}[^0-9X]w{8}

If the expression matches the VIN, it contains an invalid check digit and should be ignored.

Another problem is that some manufacturers have assigned invalid VINs to their cars, i.e., VINs with invalid check digits. One documented example is Lamborghini. Therefore, a wrong check digit does not mean that the VIN is not assigned to a car. Any check must take these wrong series into consideration as well. One example is to have a list of regular expressions ready which identify invalid check codes. In the example of Lamborghini, the regulare expression

ZHWw{5}w1w{7}

could be used to identify Lamborghini’s with model year 2001 which might have an invalid check digit.

Given all the problems, one question remains: how to calculate the check digit and verify it for a given vehicle identification number? This is covered in the next section.

Check Digit Verification

Before we verify the check digit, we have to remove it from the VIN to be checked. There are two options to achieve this:

  1. remove the check digit for the purpose of calculation
  2. utilize the multiplicate property of zero in the weight to cancel it out.

We will use the second method in our calculations below.

First, the alphabetic characters have to be translated to numeric equivalents. We use the EBCDIC method for creating the following chart:

The characters I, O, and Q are not allowed in a Vehicle Identification Number and have been left out from the chart. Numerical digits in the VIN are using their respective own values.

Each position in the VIN is being weighted. We use the following chart to determine the weight of each position:

The weight of the check digit on position 9 has been set to 0 (as described before).

Each character (the numeric equivalent) of the VIN is multiplied with the weight of its position. All these products are then summed up and the result is divided by 11. The remainder (see the MOD function) leads to the check digit. If the remainder is 10, the check digit becomes X, otherwise the check digit is just the numeric reminder (therefore, a number between 0 and 9).

If the check digit from our calculation does not match the original check digit (the one from the VIN to be verified), the VIN is invalid. However, a match does not prove the VIN is correct because there is still a 1 in 11 chance of any two distinct VINs having a matching check digit: an example of this would be the valid VINs 5GZCZ43D13S812715 (correct with leading five) and SGZCZ43D13S812715 (incorrect with leading character ‘S’).

3 thoughts on “Dealing With Vehicle Identification Numbers (VIN) #Data Quality

  1. One more idea: by law, the check digit must be valid for specific regions, such as North America. The NHTSA provides a database with World Manufacturer Identifiers (WMIs) that includes the country of each manufacturer.

    We could use this information to improve the checker and enforce the use of the check digit for those regions.

  2. Pingback: Vehicle Identification Number (VIN) Validator for #SQLServer | Michael Olschimke

Leave a Reply

Your email address will not be published. Required fields are marked *