A regular expression (abbreviated to regex or regexp) is a string which describes a search pattern to look for in a string, used in several scenarios like data validation, searching, parsing, scraping, etc.
It looks like this:
/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
In this example we can see a regular expression which could be used to match a website url. As you can see, we can use only a line of code to perform a complex schema validation in a very short time.
A ReDoS (Regular expression Denial of Service) attack exploits some "evil" patterns in a regex in order to increase its evaluation time until to overload or hang an application that uses it.
Example of attack
Consider a simple Javascript snippet that use above regex example to verify if a string is a valid url. If you want try out, you can copy-paste it in your browser's console.
var validUrlRegex = /^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
var url = 'https://google.com/'
console.time('regex evaluation time')
if (url.match(validUrlRegex)) {
console.log(url + ' is a valid url')
} else {
console.log(url + ' is not a valid url')
}
console.timeEnd('regex evaluation time')
The output should be:
https://google.com/ is a valid url
regex evaluation time: 2ms
But now consider to apply the regular expression on a bad url introduced in our code by an attacker.
var validUrlRegex = /^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
var badUrl = 'https://google.com/aaaaaaaaaaaaaaaaaaaaaaaa@'
console.time('regex evaluation time')
if (badUrl.match(validUrlRegex)) {
console.log(badUrl + ' is a valid url')
} else {
console.log(badUrl + ' is not a valid url')
}
console.timeEnd('regex evaluation time')
Output:
https://google.com/aaaaaaaaaaaaaaaaaaaaaaaa@ is not a valid url
regex evaluation time: 2196ms
As you can observe, the evaluation time of regex has been increased by 1000 times by adding only 25 characters to previous url. Let's see why.
How a regex engine works
A regex engine is a nondeterministic finite automaton (NFA). In computer science a NFA is a machine has to process each character of an input string (i.e. the above url strings) and for each of them it can proceed in several ways during the evaluation.
When an engine starts to process a string, it tries to apply the regex pattern from the first character of string. Since there could be several permutations (said also paths) to match the regex, the engine has to tries all them and then returns the first valid one. If it doesn't find a valid one, it tries to apply the regex pattern to the next character of string so on. The engine continues to do this until to find a match in the string or to to end of the string.
Consider this regex
/(a+)+c/
and a string to test
aaaac
The regex matches a string with any number of "a" followed by "c", so our test string is a valid match. But, since we have used two quantifiers +
, there are several paths that the engine can evaluate to match the regex (each group of letters represents a matching regex group):
aaaa c
aa aa c
aa a a c
a a a a c
and so on. In this case there are several paths that return a valid match, so the engine doesn't need to try all to find a right one. It will choose the first in the above list because the engine is eager. It means that it first evalutes groups of many characters and then groups with less.
But now let's try to apply the same regex to the following non matching string:
aaaasc
As before, there are several available paths (in square brackets the character that makes the string not matching):
aaaa [s] c
aa aa [s] c
a a a a [s] c
and so on. When the engine tries a path and reaches the character that makes the match invalid, it tries to check the other paths until to find one that matches regex or to finish the available ones. Because no path is valid, the engine needs to try all them before to finish its evaluation.
This means that the time and the needed machine's power for the evaluation of a string can increase according to number of paths to try.
We can measure the perfomance of a regex for a given string by counting number of steps. Each time the engine evaluates a string and changes the character from which the regex pattern is applied, the number of steps is increased. The number of steps is directly proportioned to evalution time.
Here a table to show some benchmarks of the above regex applied on different strings.
String | Valid | Time | Steps |
aaaaaaaaaac | Y | ~0ms | 6 |
aaaaaaaaaaaaaaaac | Y | ~1ms | 6 |
aaaaaaaaasc | N | ~4ms | 3057 |
aaaaaaaaaaasc | N | ~19ms | 12271 |
aaaaaaaaaaaaasc | N | ~62ms | 49133 |
aaaaaaaaaaaaaaasc | N | ~244ms | 196587 |
How to defend your application from a ReDoS attack
As you can see, it's enough a bad regex and/ or a bad string to perform a ReDoS and hang a system. The main concern is that it can happen in several scenarios: an hacker attack, an error in the syntax of regex, a not sanitizied string, etc. For this reason there are several things that you should do in order to secure your application.
Checking the evil patterns in your regex
You should pay attention when you use a quantifier (*
, +
, {1,}
, ?
) together with another quantifier or an alternation (|
). Some example of bad regex are:
(a+)+
([a-zA-Z]+)*
(a|aa)+
(a|a?)+
Think about input that your regex should evaluate and give it strictest rules. Prefer a quantifier with a maximum quantity specified (i.e. {1,4}
) where it is possible. Consider if replacing it with a string method could be a better solution for your case.
Sanitizing and filter user's inputs
This is a rule valid also for other type of attacks. Since the attack can be originated from a input string, filter and sanitize it before to evaluate it to avoid unhandled errors or attacks.
Discovering vulnerabilities with fuzz testing
Some regex are very complex and for this reason we may not catch at the first time all vulnerabilities or edge cases within it. As additional security layer you can use fuzzing to discover these cases before falling victim. Just test your regex with random inputs and pay attention when evalution time changes drastically because it could be the symptom that in your regular expression there is a vulnerability.