In what data type should I store an email address in database?
I understand that an 254 character email address is valid, but implementations I have researched tend to use a varchar(60) to varchar(80) or equivalent. For example: this SQL Server recommendation uses varchar(80) or this Oracle example
Is there a reason to not use the full 254 character maximum? Doesn't a varchar by definition use only as much storage as needed to hold the data?
Are there significant performance implications/trade-offs which cause so many implementations to use less than the full 254 possible characters?
I've always used
VARCHAR(320). Here's why. The standard dictates the following limitations:
- 64 characters for the "local part" (username).
- 1 character for the
- 255 characters for the domain name.
Now, some folks will say you need to support more than that. Some folks will also say that you need to support Unicode for domain names (meaning you have to switch to
NVARCHAR). While the standard may change in the meantime (it's been a while since I've had skin in the game), I am quite confident that at this time most servers in the world will not accept Unicode e-mail addresses, and I am sure many servers will have issues creating and/or accepting addresses with > 320 characters.
That said, you can prepare for the worst now, if you like (and if you are using Data Compression in SQL Server 2008 R2 or better, you will benefit from Unicode compression, meaning you only pay the 2 byte penalty for characters that actually need it). This way you can make your column as wide as you want, and you can let people stuff any too-long junk in there that they want - they won't receive an e-mail if they give you junk just like they won't receive an e-mail if the insert fails. The problem is if you let invalid junk in, you have to deal with it. And no matter what size you make it - if someone will try to stuff 400 characters into a 320-character column, someone will try to stuff 1025 characters into a 1024-character column. There is no reason any sensible person should have an e-mail address > 320 characters unless they are using it to explicitly test system boundaries.
But stop asking for opinions on this - and stop looking at other implementations for guidance (it just so happens in this case that the ones you referenced did not bother to do their own homework and just picked numbers out of their, well, you know). You have direct access to the standard - make sure you consult the most current version, support that as a minimum, and stay on top of the standard so you can adapt to changes in specs.
EDIT thanks to @ypercube for the ping in chat.
As an aside, perhaps you don't want to dump the whole address into a single column in the first place. Normalization might suggest that you don't want to store
@hotmail.com15 million times when a much skinnier FK int would work just fine and not have the additional overhead of variable length columns. You could also normalize the username, as
[email protected]share a common username - they don't know each other but your database doesn't care about that.
I talked about some of this here:
This introduces challenges however to the 254-character limit above, since there doesn't seem to be consensus about what happens when a valid 255-character domain is combined with a valid 1-character localpart. This should be accepted by most servers around the world but seem to violate this 254-character limit. So do you create a
Domainstable that has an artificially lower restriction on length for e-mail addresses, when the domain could be re-used as a valid 255-character URL?
I like this approach but what about the email uniqueness? How is it managed?
There are a few considerations with this decision. First and foremost is to use current and future predictions of necessary limitations that the data will have to conform to. There's a reason why you don't want to set every string column data type to
varchar(1024)when you are just storing a string that shouldn't exceed 32 characters (emphasis on the should keyword).
If you have some sort of vulnerability where emails are all modified to become 255 characters, then you could potentially have a long performance impact of page splits. This may seem out of the ordinary, and it most likely is, but you need to size your data to the business requirement. Much like the age-old constraint at the database vs. application debate, I'm a firm believer that data type limitations and allowable values should also be enforced at the data tier.
Which leads me to my next point. The database is most likely just the data tier. What does the application tier utilize? For instance, if you have an application where you can only enter 80 characters for an email address, why would you want to the data type to be any larger? Business needs to answer two questions:
- What can it be?
- What should it be?
Only then you'll have your answer.
Doesn't a varchar by definition use only as much storage as needed to hold the data?
Yes and no. There is going to be a sort of offset for the variable length data to record the length of it.
RFC 5321 (the current SMTP spec, obsoletes RFC2821) states:
The maximum total length of a user name or other local-part is 64 octets. The maximum total length of a domain name or number is 255 octets
So 64 + 255 + @ sign implies VARCHAR(320). You probably will never need this much but it's safe to have it, just in case.
The correct limit is 254. http://www.rfc-editor.org/errata_search.php?rfc=3696&;eid=1690
As a comment to the excellent answers already here:
First, if you have created the field as
varchar(240)and you want to later change it to a longer field, say
varchar(320), this change should be a trivial operation on the database server - depending, of course, on your database product.
alter table Schema.Object alter column EmailAddress varchar(320) ;
Second, depending on the average row size and the page size, using
varchar(240)might not change the number of allocated pages (the disk space actually taken up by the table).
Third, someone above talked about validating an email address. I contend that there is only one sure way to validate an email address and that is to send an email to it. :-)
Any variation of VARCHAR uses only as much space in the data block as needed. The additional bytes for storing the length are trivial compared to the space that would be wasted using a fixed-length CHAR instead.
Since a VARCHAR column length is really a "maximum length," it should be set larger than the maximum length possible under any circumstances. Only as much space as each row needs will be used. The application programs should then be designed with scrolling fields or whatever makes sense based on typical values.
A database design is like a physical piece of paper in that it sets out the hard limits as to size. A paper page cannot be enlarged. In this analogy, the application program is like a form printed on the page. There is a lot that can be done to adjust how much data we can hold in the form.
Though the command to increase a VARCHAR size may look simple and run instantly on a small table, doing so on a table with thousands of rows or more is probably going to require some kind of database quiesce while regenerating all the data and index blocks. One way is to copy everything to a new table with the larger columns. Whatever technique is used, it's a-big-hairy-deal. Thus, you should consider the VARCHAR column size largely immutable once a production table is loaded.
VARCHAR is the best data type to be used for email addresses as Emails vary a lot by length. NVARCHAR is also an alternative but I would recommend it to use only if the email address contains extended chars and keep in mind that it requires double amount of storage space as compared to VARCHAR.
In my environment, we use varchar(70) as the longest ones that I have came across are closely 60-70 char long, but it depends on your company's customer base as well. Also, as a side-note, make sure that you have some Email validation check in-place for the validity of Email addresses.. like using check constraints or CHARINDEX
If you're using an Enterprise Database server, there should be someway to store an email address as a
DOMAINwith some level of validity. Domains are specified in the SQL specification
A domain is a named user-defined object that can be specified as an alternative to a data type in certain places where a data type can be specified. A domain consists of a data type, possibly a default option, and zero or more (domain) constraints.
For instance, the free and open source PostgreSQL supports this, barring any limitations in your implementation of the spec, the column itself contains a valid email. You can for instance..
- Create a custom
DOMAINover the HTML5 spec of email.
- Or, over the RFC822, RFC2822, RFC5322 spec of email.
- Create a custom
DOMAINthat checks the server for an MX-record at the time of check.
I evaluate these options in this answer which is specific to PostgreSQL
- Create a custom