How to solve UTF8 invalid byte sequence copy errors on a restore, when the source database is encoded in UTF8?

  • I was given the task to migrate a PostgreSQL 8.2.x database to another server. To do this I'm using the pgAdmin 1.12.2 (on Ubuntu 11.04 by the way) and using the Backup and Restore using the custom/compress format (.backup) and UTF8 encoding.

    The original database is in UTF8, like so:

    -- Database: favela
    
    -- DROP DATABASE favela;
    
    CREATE DATABASE favela
      WITH OWNER = favela
           ENCODING = 'UTF8'
           TABLESPACE = favela
           CONNECTION LIMIT = -1;
    

    I'm creating this database exactly like this on the destination server. But when I restore the database from the .backup file using the Restore option it gives me some of these errors:

    pg_restore: restoring data for table "arena"
    pg_restore: [archiver (db)] Error while PROCESSING TOC:
    pg_restore: [archiver (db)] Error from TOC entry 2173; 0 35500 TABLE DATA arena favela
    pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence for encoding "UTF8": 0xe3a709
    HINT:  This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
    CONTEXT:  COPY arena, line 62
    

    When I check which record triggered this error in fact some vartext fields have diacritical characters like ç (used in Portuguese, for example, "caça"), and when I manually remove them from the text in the records the error passes to the next record that has them - since when copy has an error it stops inserting data on this table. And I don't want to replace them manually one by one to accomplish this.

    But it's kinda of strange because with UTF8 there shouldn't be this kind of problems, right?

    I don't know how they got there in the first place. I'm only migrating the database, and I supose that somehow the database was like in LATIN1 and then was improperly changed to UTF8.

    Is there any way to check if a table/database has invalid UTF8 sequences? Or any way to enforce/reconvert these characters into UFT8 so I don't run into any problems when I execute the restore?

    Thanks, in advance.

  • Richard

    Richard Correct answer

    9 years ago

    Digging around the internet, I've seen that this is a pretty common problem. The common solution is to use the plain text format dump and feed it through iconv to correct the encoding.

    Here is more information about that.

    use iconv to convert to UTF-32 discarding invalid symbols and then back to UTF-8, a UTF-8 to UTF-8 conversion won't catch all bad code points. (eg orphan surrogates)

  • "I don't know how they got there in the first place"

    It could have happened as described here - although this generates an error on 8.4:

    If you create a table with any text type (i.e. text, varchar(10), etc.), then you can insert an invalid byte sequence into that field using octal escapes.

    For instance, if you have a UTF8-encoded database, you can do:

    => CREATE TABLE foo(t TEXT);

    => INSERT INTO foo VALUES(E'\377');

    Now, if you COPY the table out, you can't COPY the resulting file back in. That means your pg_dump backups won't be able to restore. The only way to get your data back in is to re-escape that value.

    There is a good post on this excellent blog about the general issues and some ways to deal with them

  • Its likely with the default encoding used in your Unix/Linux environment. To check which encoding is currently the default one, execute the following:

    $ echo $LANG
    en_US
    

    In this case, we can clearly see it is not an UTF-8 encoding, the one which the copy command relies on.

    So to fix this, we just set the LANG variable in example to the following:

    $ export LANG=en_US.UTF-8
    

    Note: This will only be available for the current session. Add it to ~/.bashrc or similar to have it available on startup of any future shell session.

    Reference

  • I don't recommend blindly running iconv on the plain text dump because it may convert valid characters(eg: Chinese characters ) to some other characters. It is better to find the invalid UTF8 character by running below command.

    grep -naxv '.*' plain_text_dump.sql
    

    and then run iconv on the particular data. Check this doc for detailed step by step explanation.

  • I referenced the following link which gave me clues to determine the source encoding and then convert it into the desired UTF-8 encoding. Linux Check and Change Encoding

    $ file -bi cabot.sql
    text/plain; charset=utf-16le
    $ iconv -f utf-16le -t utf-8 -o converted.sql cabot.sql
    $ file -bi converted.sql
    text/plain; charset=utf-8
    

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM