The last few months some of our applications crashed periodically. Thanks to WebError ErrorMiddleware, we receive an email each time an internal server error occurs.
For example someone tried to retrieve all of our french territories data with the API.
End-user tools like web browsers generate valid UTF-8 requests with no effort, but non UTF-8 requests can be generated by some odd software or by hand from a ipython shell.
Let's dive into the problem in ipython :
This shows that U+00E9 is the Unicode codepoint for the 'é' character ( see Wikipedia), that its UTF-8 encoding are the 2 bytes '\xc3\xa9', and that decoding in UTF-8 a latin1 byte throws an error.
The stack trace attached to the error e-mails helped us to find that the UnicodeDecodeError exception occurs when calling one of these Request methods: path_info, script_name and params.
So we wrote a new WSGI middleware to reject mis-encoded requests, returning a bad request HTTP error code to the client.
The source code of this middleware is published on Gitorious: reject-misencoded-requests
We could have guessed the encoding, and set the Request.encoding attribute, but it would have fixed only the read of PATH_INFO and SCRIPT_NAME, and not the POST and GET parameters which are expected to be encoded only in UTF-8.
That's why we simply return a 400 bad request HTTP code to our users. This is simpler and does the work.