I am using python to aggregate HTML data, parse it, clean it and pass it to another server. If you have ever done this, you will know how unicode can come back to bite you.

Firstly, go read http://farmdev.com/talks/unicode/ to get an understanding of what ASCII / Unicode errors are about.

Secondly, here are a few tips:

  1. Pass your HTML data through LXML. Let the 3rd party libraries do the heavy lifting.
  2. If you are running a cronjob, note that a print may crash because STDOUT does not have a default encoding associated with it. Explicitly encode your strings when printing them in CRON.