Lack of data validation - Source Code - Python

Need

Implementation of robust data validation mechanisms in the source code

Context

Usage of Python 3 for developing applications and scripts
Usage of the re library for regular expression matching and manipulation

Description

Non compliant code

        import re

def validate_data(request):
    data = request.POST['data']
    # Dangerous regular expression
    pattern = re.compile('^(a+)+$')
    if pattern.match(data):
        return True
    else:
        return False

The above code is a simple Django view function that takes a POST request and attempts to validate the data using a regular expression. The regular expression used here is ^(a+)+$, which is a dangerous regular expression because it can lead to a ReDoS (Regular Expression Denial of Service) attack.

This regular expression is dangerous because it uses nested quantifiers. The + quantifier means "one or more", so (a+)+ means "one or more of (one or more of 'a')". This can lead to excessive backtracking when trying to find a match.

For example, if an attacker sends a string of 'a's followed by a single 'b' (e.g., "aaaaaaaab"), the regular expression engine will try to find a match by considering every possible combination of 'a's
- this is known as "backtracking". The number of combinations grows exponentially with the length of the input string, leading to a dramatic increase in CPU usage and potentially causing the server to crash.

This is particularly dangerous in a web application context, as an attacker could simply send a large number of such requests to cause a denial of service.

Steps

Identify the dangerous regular expressions in your code. In this case, the regular expression '^(a+)+$' is considered dangerous because it can lead to catastrophic backtracking.
Replace the dangerous regular expressions with safer alternatives. Avoid using quantifiers on both sides of a character or group, as this can lead to excessive backtracking.
Consider using a timeout for regular expression operations to prevent the server from hanging if a dangerous input is encountered.
Implement input validation and sanitization. Ensure that the input matches the expected format before processing it with a regular expression.
Use a static code analysis tool to automatically detect dangerous regular expressions in your code.
Regularly update your knowledge about regular expressions and their potential dangers. This can help you write safer code.

Compliant code

        import re

def validate_data(request):
    data = request.POST['data']
    # Safer regular expression
    pattern = re.compile('^[a]+$')
    try:
        if pattern.fullmatch(data):
            return True
        else:
            return False
    except re.error:
        return False

The original code had a dangerous regular expression ^(a+)+$ which could lead to catastrophic backtracking. This is a situation where the regular expression engine can end up in a state where it tries all possible permutations of the input, which can cause the server to hang or crash if the input is specifically crafted by an attacker.

The fixed code replaces the dangerous regular expression with a safer one ^[a]+$. This regular expression will match a string that contains only the character 'a', without causing excessive backtracking.

The re.fullmatch() function is used instead of re.match(). The re.match() function checks for a match only at the beginning of the string, while re.fullmatch() checks for a match over the entire string, which is more suitable for our validation purpose.

A try/except block is added to catch any re.error exceptions that might be raised if the regular expression fails to compile or match. This is a good practice to prevent unexpected crashes.

The fixed code does not include a timeout for the regular expression operation, as Python's re module does not support this feature. If you need to use a timeout, consider using the regex module instead, which is a drop-in replacement for re that supports timeouts.

Remember to always validate and sanitize input data before processing it with a regular expression. This can help prevent many types of injection attacks.

References

187. Lack of data validation - Source Code