### What is the influence of C in SVMs with linear kernel?

• I am currently using an SVM with a linear kernel to classify my data. There is no error on the training set. I tried several values for the parameter $C$ ($10^{-5}, \dots, 10^2$). This did not change the error on the test set.

Now I wonder: is this an error caused by the ruby bindings for libsvm I am using (rb-libsvm) or is this theoretically explainable?

Should the parameter $C$ always change the performance of the classifier?

Just a comment, not an answer: *Any* program that minimizes a sum of two terms, such as $|w|^2 + C \sum{ \xi_i },$ should (imho) tell you what the two terms are at the end, so that you can see how they balance. (For help on computing the two SVM terms yourself, try asking a separate question. Have you looked at a few of the worst-classified points ? Could you post a problem similar to yours ?)

• The C parameter tells the SVM optimization how much you want to avoid misclassifying each training example. For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points. For very tiny values of C, you should get misclassified examples, often even if your training data is linearly separable.

OK, I understand that C determines the influence of the misclassification on the objective function. The objective function is the sum of a regularization term and the misclassification rate (see http://en.wikipedia.org/wiki/Support_vector_machine#Soft_margin). When I change C, this does not have any effect on the minimum of my objective function. Could that mean that the regularization term is always very small?

I would suggest trying a wider range of C values, maybe 10^[-5,...,5], or more if the optimization is fast on your dataset, to see if you get something that looks more reasonable. Both the training error and the value of the minimum cost should change as C is varied. Also, is the scale of your data extreme? In general, an optimal C parameter should be larger when you scale down your data, and vice versa, so if you have very small values for features, make sure to include very large values for the possible C values. If none of the above helps, I'd *guess* the problem is in the ruby bindings

What I said is partially wrong. Actually, the value of C has an influence but it is marginal. I am calculating the balanced accuracy ((tp/(tp+fn)+tn/(tn+fp))/2) on my test set. If the complexity is 10^-5 or 10^-4, the balanced accuracy will be 0.5. When I set C to 10^-3 it is 0.79, for C=10^-2 it is 0.8, for C=10^-1 it is 0.85 and for C=10^0,...,10^7 it is 0.86, which seems to be the best possible value here. The data is normalized such that the standard deviation is 1 and the mean is 0.

changing the balanced accuracy from 0.5 (just guessing) to 0.86 doesn't sound like a marginal influence to me. It would be a good idea to investigate a finer grid of values for C as Marc suggests, but the results you gave given seem to be fairly normal behaviour. One might expect the error to go back up again as C tends to infinity due to over-fitting, but that doesn't seem to much of a problem in this case. Note that if you are really interested in balanced error and your training set doesn't have a 50:50 split, then you may be able to get better results...

... by using different values of C for patterns belonging to the positive and negative classes (which is asymptotically equivalent to resampling the data to change the proportion of patterns belonging to each class).

Indeed, this is not marginal. :D What confuses me is that C=10^0,...,10^7 (and I think 10^i for i > 7 also would) produce exactly the same confusion matrix (which should actually the best possible result). High complexity means the influence of the regularization term is insignificant. I think that means the regularization term does actually worsen the result?

I think it is possible that once you get to C=10^0 the SVM is already classifying all of the training data correctly, and none of the support vectors are bound (the alpha is equal to C) in that case making C bigger has no effect on the solution.

@MarcShivers "In general, an optimal C parameter should be larger when you scale down your data, and vice versa, so if you have very small values for features, make sure to include very large values for the possible C values" How to prove that in the SVM formulation ?

Value of C doesn't affect linear kernel.It only effects other kernels in the way @MarcShivers wrote.