Logistic Regression: A Comprehensive Guide
📊 Statistics & Machine Learning
Logistic Regression: A Comprehensive Guide
Logistic regression is the bedrock of binary classification in statistics and machine learning. This guide covers everything — from the sigmoid function and odds ratios to multinomial models, Python implementation, and model evaluation — written clearly for students and working professionals who need to understand it, use it, and explain it.
Definition & Overview
What Is Logistic Regression?
Logistic regression is one of the most widely used statistical and machine learning techniques for classification tasks. Despite the word "regression" in its name, it does not predict a continuous numerical outcome. It predicts the probability that an observation belongs to a particular category. That makes it fundamentally different from linear regression, even though both methods share a linear model structure at their core.
Think of it this way. You want to know whether a student will pass or fail an exam based on hours studied. Linear regression would produce an unbounded output — a predicted score that could technically be 1.4 or negative. Logistic regression instead asks: what is the probability this student passes? And it keeps that probability between 0 and 1, where it belongs. This is the foundational insight behind the model. Regression analysis more broadly covers a range of predictive techniques, but logistic regression occupies a special place because it bridges statistical inference and modern machine learning seamlessly.
Logistic regression is used across virtually every data-intensive field. In medicine, it predicts whether a patient has a disease. In finance, it flags whether a loan applicant will default. In technology, it powers spam filters. In social science research, it models binary survey outcomes. Its interpretability, computational efficiency, and well-understood mathematical properties make it a first-choice model in many professional and academic settings.
0–1
The output range of a logistic regression model — always a probability, never an unbounded number
1958
The year David Cox formally introduced logistic regression analysis in the Annals of Human Genetics
3
Main types — binary, multinomial, and ordinal — each suited to a different structure of the outcome variable
Why Is It Called "Regression" if It Classifies?
This trips up a lot of students. Logistic regression is technically a regression model because it estimates a continuous quantity — a probability. The classification step happens afterward: if the estimated probability exceeds a threshold (typically 0.5), the observation is assigned to one class; otherwise, to the other. The model itself is a regression of log-odds on the predictor variables. So the name is historically accurate even if it initially sounds misleading.
The core idea: Logistic regression takes a linear combination of predictor variables and transforms it through the sigmoid function to produce a probability. That probability is then used to make a classification decision. The "regression" is on the log-odds of the outcome; the "logistic" is the S-shaped curve that squashes the output into [0, 1].
A Brief History
The logistic function itself was first described by Pierre François Verhulst, a Belgian mathematician, in 1838 — he used it to model population growth. In statistics, Joseph Berkson at the Mayo Clinic introduced the logit transformation in the 1940s. The formal logistic regression framework as we use it today was largely developed by David Cox at the University of Cambridge, who published the foundational paper in 1958 in the Annals of Human Genetics. Since then, logistic regression has become a staple in epidemiology, biostatistics, econometrics, and machine learning. Understanding the difference between descriptive and inferential statistics gives essential context for where logistic regression sits within the statistical toolkit.
Mathematical Foundation
The Sigmoid Function and the Math Behind Logistic Regression
You cannot fully understand logistic regression without understanding the sigmoid function. It is the mathematical mechanism that converts any real number — positive, negative, enormous, tiny — into a probability between 0 and 1. That is the crucial transformation that makes logistic regression work.
What Is the Sigmoid Function?
The sigmoid function has this formula:
σ(z) = 1 / (1 + e−z)
Where z is the linear predictor: z = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ
When z is very large and positive, e−z approaches zero and σ(z) approaches 1. When z is very large and negative, e−z becomes enormous and σ(z) approaches 0. At z = 0, σ(z) = 0.5. This S-shaped curve — the logistic curve — is exactly why the method is called logistic regression. The sigmoid function is also called the logistic function, and the two terms are interchangeable in this context.
For the model to work, we need that linear predictor z. In logistic regression, z is exactly what you would compute in linear regression: a weighted sum of the input features plus an intercept. The difference is that instead of using z as the final output, we pass it through the sigmoid. The model is learning the weights (β coefficients) that produce the best probabilities for the known outcomes in the training data. To see how this connects with other predictive frameworks, the guide on simple linear regression is a useful starting point for comparison.
What Are Log-Odds and the Logit Transformation?
Here is where the math gets elegant. If p is the probability of the positive outcome, then the odds of that outcome are p / (1 − p). The log-odds (also called the logit) are simply the natural logarithm of the odds:
logit(p) = ln(p / (1 − p)) = β₀ + β₁x₁ + ... + βₙxₙ
The log-odds are a linear function of the predictor variables — this is why logistic regression is a linear model in the log-odds space
The logit transformation is why logistic regression belongs to the family of generalized linear models (GLMs). The outcome being modeled is not the raw probability but the log-odds of the probability. This transformation makes the relationship linear, which is computationally tractable. When you exponentiate the regression coefficients (eβ), you get odds ratios — the quantities that are reported and interpreted in research papers and clinical studies.
What Is Maximum Likelihood Estimation (MLE)?
Linear regression uses least squares to find the best-fitting line — minimizing the sum of squared residuals. Logistic regression cannot use least squares because the outcome is categorical, not continuous. Instead, it uses maximum likelihood estimation (MLE).
MLE finds the set of coefficients (β values) that maximize the likelihood function — the probability of observing the actual training data given the model's current parameter values. In plain terms: it finds the coefficients that make the observed outcomes most probable under the model. This is done iteratively using numerical optimization algorithms — most commonly gradient descent or Newton-Raphson methods — since there is no closed-form solution for logistic regression coefficients the way there is for linear regression. The guide on regression model assumptions covers the theoretical scaffolding that underpins both linear and logistic frameworks.
Log-Loss (Binary Cross-Entropy): The cost function that logistic regression minimizes during training is called log-loss or binary cross-entropy. For each training observation, it penalizes the model heavily when a confident prediction is wrong, and lightly when a confident prediction is correct. Minimizing log-loss is mathematically equivalent to maximizing the likelihood of the observed data.
The Decision Boundary
After training, the model uses a decision boundary to classify new observations. The default threshold is 0.5: if the predicted probability is above 0.5, the model assigns the positive class (1); otherwise, the negative class (0). This boundary is not fixed — you can raise or lower it depending on the cost asymmetry of false positives and false negatives in your specific application. In medical diagnosis, for example, a lower threshold may be appropriate to capture more true positives at the cost of more false positives.
In two-dimensional feature space, the decision boundary of logistic regression is a straight line. In higher dimensions, it is a hyperplane. This linearity is both the model's strength (interpretable, fast) and its limitation (it cannot capture complex, non-linear decision boundaries without feature engineering).
Types & Variants
Types of Logistic Regression: Binary, Multinomial, and Ordinal
Logistic regression is not a single model — it is a family of models whose specific form depends on the structure of the outcome variable. Choosing the wrong type is a common error in statistics assignments and research papers. Here is exactly how to distinguish them.
B
Binary Logistic Regression
Two outcome classes — pass/fail, yes/no, disease/no disease. The most common type. Uses the sigmoid function and a single set of coefficients.
M
Multinomial Logistic Regression
Three or more unordered outcome classes — e.g., political party affiliation (Democrat, Republican, Independent). Uses the softmax function.
O
Ordinal Logistic Regression
Three or more ordered outcome classes — e.g., satisfaction level (low, medium, high). Accounts for natural ranking using cumulative logit models.
Binary Logistic Regression
Binary logistic regression is the standard form. The dependent variable takes exactly two values — conventionally coded 0 and 1. The model estimates the probability that an observation belongs to class 1 given a set of predictor variables. This is the form you will encounter most often in introductory statistics courses, research methods classes, and data science programs. Applications include spam detection, disease prediction, loan default modeling, and binary survey responses.
One thing worth appreciating: binary logistic regression is remarkably robust in practice. It handles mixed predictor types (continuous and categorical), works with moderate sample sizes, and produces interpretable coefficients. That combination explains its extraordinary longevity as a standard tool in both research and industry.
Multinomial Logistic Regression
Multinomial logistic regression extends binary logistic regression to situations where the dependent variable has three or more unordered categories. It does not assume any ranking among the categories. Instead of a sigmoid function, it uses the softmax function to produce a probability for each class simultaneously — and these probabilities always sum to 1.
In practice, the model picks one category as the reference (baseline) and estimates the log-odds of each other category relative to that reference. If you are predicting which type of transportation a commuter will choose (car, bus, bicycle, or train), multinomial logistic regression is the right tool. See the full guide on the multinomial distribution for the probabilistic foundation that underlies this model.
Ordinal Logistic Regression
Ordinal logistic regression handles outcomes with three or more categories that have a natural order — like education level (high school, bachelor's, master's, doctorate) or Likert scale responses (strongly disagree to strongly agree). The most common form is the proportional odds model, which estimates the log-odds of being at or below each category threshold, assuming these odds ratios are constant across threshold levels. Violating the proportional odds assumption is a common modeling error — it should always be tested before reporting results.
Which Type Should You Use?
The decision rule is simple. Two outcome categories: binary logistic regression. Three or more unordered categories: multinomial logistic regression. Three or more ordered categories: ordinal logistic regression. If you mistakenly treat an ordinal outcome as nominal (using multinomial instead of ordinal regression), you lose statistical power and discard meaningful information embedded in the ranking.
Simple vs. Multiple Logistic Regression
A second axis of classification cuts across all three types above: the number of predictor variables. Simple logistic regression uses a single predictor. Multiple logistic regression (also called multivariable logistic regression) uses two or more predictors simultaneously. The multiple version is far more common in real research because most outcomes are influenced by more than one factor. Multiple logistic regression estimates the independent effect of each predictor on the outcome while controlling for the others — which is essential for drawing valid causal inferences in observational studies.
Multiple logistic regression requires attention to multicollinearity — when two or more predictors are highly correlated with each other. Severe multicollinearity destabilizes coefficient estimates and inflates standard errors. The variance inflation factor (VIF) is the standard diagnostic. For related regularization techniques that help manage these issues, the Ridge and Lasso regression guide covers the penalized regression methods that can stabilize logistic models in high-dimensional settings.
Need Help With a Logistic Regression Assignment?
Our statistics and data science experts write complete, well-explained logistic regression analyses — from model setup to result interpretation — matched to your course requirements and delivered fast.
Get Statistics Help Now Log InAssumptions & Requirements
Assumptions of Logistic Regression You Must Know
Logistic regression is a powerful model, but it is not assumption-free. Running a logistic regression without checking its assumptions is one of the most common errors in student assignments and published research alike. Violating these assumptions does not always destroy the analysis, but it can bias your coefficient estimates, inflate your standard errors, or produce meaningless significance tests.
1. Binary or Categorical Dependent Variable
Logistic regression requires the outcome variable to be categorical. Binary logistic regression needs exactly two categories. This sounds obvious, but students sometimes force a continuous outcome into logistic regression by dichotomizing it — for example, splitting a test score at 50 to create pass/fail. This is almost always a bad idea. It discards information, reduces statistical power, and introduces an arbitrary threshold. If your outcome is truly continuous, linear regression or another continuous-outcome model is the appropriate choice.
2. Independence of Observations
Each observation in your dataset must be independent of the others. This means the outcome for one observation should not influence or be influenced by the outcome of another. This assumption is violated in clustered data (students within schools), repeated measures data (the same patient measured multiple times), or matched case-control data. When independence is violated, standard logistic regression standard errors will be incorrect — typically underestimated — producing falsely narrow confidence intervals and inflated statistical significance. Mixed-effects logistic regression or conditional logistic regression handles these situations.
3. Linearity of Continuous Predictors and Log-Odds
This is the assumption that catches many students off guard. Logistic regression does not require a linear relationship between predictors and the probability of the outcome. But it does require a linear relationship between each continuous predictor and the log-odds of the outcome. This distinction matters. A predictor that has a non-linear effect on the log-odds will produce biased estimates if entered as a raw linear term. The standard test for this is the Box-Tidwell transformation or inspection of smoothed residual plots.
4. No Severe Multicollinearity
When two or more predictor variables are highly correlated, the model cannot reliably separate their individual effects on the outcome. Coefficients become unstable and standard errors balloon. The standard diagnostic is the variance inflation factor (VIF). A VIF above 10 (some use 5 as the threshold) suggests problematic multicollinearity. Solutions include removing one of the correlated predictors, combining them into a composite score, or using regularized regression. The guide on regression model assumptions covers all of these diagnostics in detail.
5. Absence of Influential Outliers
Extreme outliers in continuous predictors can disproportionately influence logistic regression coefficients. Unlike linear regression, logistic regression does not have a simple residual plot — instead, you use Cook's distance, leverage values, and DFBETA statistics to identify influential cases. Cases with large Cook's distance or leverage should be examined individually. If they represent data entry errors, they should be corrected. If they are genuine extreme values, sensitivity analyses with and without those cases are appropriate.
6. Sufficiently Large Sample Size
Logistic regression is a large-sample method. The classic rule of thumb — often attributed to Peduzzi et al. (1996) in the Journal of Clinical Epidemiology — is that you need at least 10 events per predictor variable (EPV). Events are the observations in the less common outcome category. So if you have 5 predictors and the positive outcome occurs 30% of the time in a dataset of 200 observations (60 events), you are right at the minimum. Fewer events per predictor leads to overfitting, unstable coefficients, and unreliable confidence intervals. Hypothesis testing in logistic regression depends on asymptotic normal theory, which performs poorly in small samples. The hypothesis testing guide covers the theoretical basis for significance testing that applies here.
7. No Perfect Separation
Perfect separation (also called complete separation) occurs when one or more predictors perfectly predict the outcome — every observation with a high value on predictor X falls in class 1, and every observation with a low value falls in class 0. When this happens, maximum likelihood estimation fails to converge: the algorithm keeps pushing the coefficient toward infinity because a larger coefficient always increases the likelihood. Perfect separation is more common than people expect, particularly with small datasets or dummy-coded variables. Penalized likelihood methods (Firth logistic regression) or Bayesian priors are the standard solutions.
⚠️ Common assignment error: Students often list the assumptions of linear regression (normally distributed errors, homoscedasticity) when asked about logistic regression assumptions. These do not apply. Logistic regression errors follow a binomial distribution, not a normal one. Homoscedasticity is not an assumption. Normality of residuals is not required. Focus on the correct assumptions listed above.
Coefficients & Interpretation
How to Interpret Logistic Regression Coefficients and Odds Ratios
Interpreting the output of a logistic regression model is where students most commonly lose marks and where researchers most commonly confuse their audiences. The coefficients produced by logistic regression are not directly interpretable as probabilities. They are on the log-odds scale. To communicate results meaningfully, you almost always convert them to odds ratios.
What Does the Coefficient (β) Mean?
Each coefficient β in a logistic regression model represents the change in the log-odds of the outcome for a one-unit increase in the corresponding predictor, holding all other predictors constant. Log-odds are not intuitive to most readers — they range from negative infinity to positive infinity and require logarithmic thinking to interpret. A β of 0.4 tells you the log-odds increase by 0.4 per unit increase in the predictor. That is mathematically precise but practically hard to communicate.
What Is an Odds Ratio and How Do You Calculate It?
The odds ratio (OR) is simply the exponentiated coefficient: OR = eβ. This is the standard unit of interpretation in logistic regression. An odds ratio answers the question: how do the odds of the outcome change for a one-unit increase in the predictor?
Odds Ratio = eβ
OR > 1: Predictor increases the odds of the outcome. OR < 1: Predictor decreases the odds. OR = 1: No association.
For example: if the OR for "hours studied" predicting exam pass is 1.25, it means that each additional hour of study is associated with 25% higher odds of passing the exam, holding other variables constant. If the OR for "being female" predicting disease is 0.68, it means females have 32% lower odds of the disease than males (since 1 − 0.68 = 0.32), all else equal.
95% Confidence Intervals for Odds Ratios
Odds ratios are always reported with 95% confidence intervals (CIs). The CI tells you the range within which the true population odds ratio likely falls, with 95% confidence. If the CI does not include 1.0, the predictor is statistically significant at α = 0.05. A CI that crosses 1.0 (for example, 0.85 to 1.45) suggests no statistically significant effect. The width of the CI reflects precision: narrow intervals indicate high precision; wide intervals indicate uncertainty. For a deeper treatment of confidence intervals, see the guide on confidence intervals.
Interpreting the Intercept
The intercept (β₀) in logistic regression represents the log-odds of the outcome when all predictor variables are zero. In practice, this is often not meaningful — a value of zero may be outside the observed range of the predictors, making the intercept uninterpretable on its own. It is included in the model for mathematical completeness but is rarely the focus of substantive interpretation.
Interpreting Categorical Predictors
Categorical predictors are entered into logistic regression using dummy coding (also called indicator coding). One category is designated the reference group and coded as 0. All other categories get their own binary dummy variable. The coefficient for each dummy variable represents the log-odds of the outcome for that category relative to the reference group. The corresponding odds ratio tells you how much more (or less) likely the outcome is for that category compared to the reference.
Practical Interpretation Framework
When reporting logistic regression results, follow this structure for each predictor: state the odds ratio, the 95% CI, the p-value, and a one-sentence plain-language interpretation. Example: "Higher age was significantly associated with increased risk of cardiovascular disease (OR = 1.08, 95% CI: 1.04–1.12, p < 0.001), indicating that each additional year of age was associated with 8% higher odds of the outcome." This format is standard in epidemiology, public health, and clinical research publications.
Predicted Probabilities
Sometimes odds ratios are still not intuitive enough for your audience — particularly non-specialist readers. In those cases, predicted probabilities (also called marginal effects or average marginal effects) are more useful. A predicted probability tells you the probability of the outcome for a specific combination of predictor values. For example: "A 35-year-old female with no prior history has a predicted probability of 12% for the disease." Most statistical software (R, Stata, Python's statsmodels) can compute predicted probabilities directly from a fitted logistic model.
Python Implementation
How to Implement Logistic Regression in Python: Step-by-Step
Python is the dominant language for machine learning and data science, and scikit-learn makes logistic regression straightforward to implement. But running the code is only part of the job. Understanding what each step does — and why — is what separates a competent analyst from someone who just copied a tutorial.
1
Import the Libraries
You need NumPy and pandas for data handling, scikit-learn for the model, and matplotlib or seaborn for visualization. If you are working in an academic or research context, statsmodels is often preferable because it produces p-values, confidence intervals, and summary tables in a format closer to statistical software like R or SPSS.
Python
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler from sklearn.metrics import (confusion_matrix, classification_report, roc_auc_score, RocCurveDisplay) import matplotlib.pyplot as plt
2
Load and Explore the Data
Before fitting any model, understand your data. Check the class distribution of the outcome variable. Imbalanced classes (e.g., 95% negative, 5% positive) require special handling. Check for missing values and decide how to handle them — imputation or deletion. Inspect continuous variables for outliers using box plots or histograms.
Python
# Load your dataset df = pd.read_csv('your_dataset.csv') # Check shape and types print(df.shape) print(df.dtypes) print(df.isnull().sum()) # Check outcome class distribution print(df['outcome'].value_counts(normalize=True))
3
Preprocess: Encode Categoricals and Scale Continuous Features
Logistic regression does not strictly require feature scaling (unlike distance-based algorithms like KNN), but standardizing continuous variables improves convergence speed and makes coefficients comparable across predictors. Categorical variables must be dummy-coded. Use pd.get_dummies() and always drop the first dummy column to avoid perfect multicollinearity (the dummy variable trap).
Python
# Separate features and target X = df.drop('outcome', axis=1) y = df['outcome'] # Dummy-encode categorical columns X = pd.get_dummies(X, drop_first=True) # Train-test split (80/20) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) # Scale continuous features scaler = StandardScaler() X_train_sc = scaler.fit_transform(X_train) X_test_sc = scaler.transform(X_test)
4
Fit the Logistic Regression Model
Scikit-learn's LogisticRegression applies L2 regularization by default (controlled by the parameter C — higher C means less regularization). Set max_iter high enough that the solver converges. The solver choice matters for large datasets or multinomial problems — lbfgs works well for most standard cases.
Python
model = LogisticRegression(C=1.0, max_iter=1000, solver='lbfgs') model.fit(X_train_sc, y_train) # View coefficients coef_df = pd.DataFrame({ 'Feature': X_train.columns, 'Coefficient': model.coef_[0], 'Odds Ratio': np.exp(model.coef_[0]) }) print(coef_df.sort_values('Odds Ratio', ascending=False))
5
Evaluate the Model
Generate predictions and evaluate using the confusion matrix, classification report, and AUC-ROC score. For imbalanced datasets, pay special attention to recall (sensitivity) and precision rather than accuracy alone. Accuracy can be misleadingly high when one class dominates.
Python
y_pred = model.predict(X_test_sc) y_pred_prob = model.predict_proba(X_test_sc)[:, 1] print(confusion_matrix(y_test, y_pred)) print(classification_report(y_test, y_pred)) print(f'AUC-ROC: {roc_auc_score(y_test, y_pred_prob):.4f}') # Plot ROC curve RocCurveDisplay.from_predictions(y_test, y_pred_prob) plt.title('ROC Curve — Logistic Regression') plt.show()
Use statsmodels for Statistical Reporting
If your assignment or paper requires formal statistical output — p-values, confidence intervals, likelihood ratio tests — use statsmodels rather than scikit-learn. The statsmodels.formula.api.logit() function produces output in the format expected by academic journals and university assignments, including coefficient standard errors, z-statistics, and 95% CIs for odds ratios. The statistics assignment help service can assist with both Python and R implementations tailored to your program's requirements.
Model Evaluation
Evaluating Logistic Regression: Metrics, ROC Curves, and Goodness of Fit
A logistic regression model is only as useful as your ability to evaluate it honestly. Choosing the right evaluation metrics is not just good practice — it is a critical part of the analysis. The wrong metric can give a false impression of model quality, especially when your data has imbalanced classes.
The Confusion Matrix
The confusion matrix is the starting point for evaluating any classification model, including logistic regression. It tabulates predictions against actual outcomes across four categories:
| Category | Abbreviation | Meaning | Example |
|---|---|---|---|
| True Positive | TP | Model predicted positive; outcome was positive | Model predicted disease; patient has disease |
| True Negative | TN | Model predicted negative; outcome was negative | Model predicted no disease; patient is healthy |
| False Positive | FP | Model predicted positive; outcome was negative (Type I error) | Model predicted disease; patient is actually healthy |
| False Negative | FN | Model predicted negative; outcome was positive (Type II error) | Model predicted no disease; patient actually has disease |
From the confusion matrix, all other classification metrics are derived. Understanding Type I and Type II errors is foundational for interpreting the confusion matrix correctly and for making threshold decisions.
Key Performance Metrics
Accuracy is the proportion of all predictions that are correct: (TP + TN) / (TP + TN + FP + FN). It is intuitive but misleading on imbalanced datasets. A model that always predicts the majority class achieves high accuracy while being useless.
Precision (also called positive predictive value) is TP / (TP + FP). It answers: of all the cases the model flagged as positive, what fraction actually are positive? High precision is critical when false positives are costly — for example, in spam detection, you do not want legitimate emails flagged as spam.
Recall (also called sensitivity or true positive rate) is TP / (TP + FN). It answers: of all the actual positive cases, what fraction did the model catch? High recall is critical when false negatives are costly — in disease screening, missing a true case can be life-threatening.
The F1-score is the harmonic mean of precision and recall: 2 × (Precision × Recall) / (Precision + Recall). It is the standard single-number metric for imbalanced classification problems because it balances both precision and recall. See the full treatment of statistical power and effect sizes for the theoretical framework that connects these metrics to research design decisions.
The ROC Curve and AUC
The Receiver Operating Characteristic (ROC) curve plots the true positive rate (recall) against the false positive rate at every possible classification threshold from 0 to 1. It visualizes how the model's discrimination ability changes as the threshold is varied. A model that performs no better than random chance produces a diagonal line from (0,0) to (1,1). A perfect model reaches the top-left corner — 100% recall with 0% false positive rate.
The AUC (Area Under the ROC Curve) collapses the ROC curve into a single number between 0 and 1. An AUC of 0.5 means the model is no better than random; an AUC of 1.0 means perfect discrimination. Conventionally: AUC 0.7–0.8 is acceptable, 0.8–0.9 is excellent, above 0.9 is outstanding. AUC is the standard metric for comparing logistic regression models in clinical research and data science competitions.
Goodness-of-Fit Tests
Beyond prediction metrics, logistic regression requires goodness-of-fit assessment — does the model adequately describe the data? The Hosmer-Lemeshow test is the standard. It groups observations into deciles of predicted probability and compares observed to expected counts within each group. A non-significant p-value (p > 0.05) suggests adequate fit. The test is not without limitations — it is sensitive to sample size and the choice of groups — but it remains the most widely used goodness-of-fit diagnostic for logistic models.
Pseudo-R² statistics (McFadden's R², Nagelkerke's R²) are sometimes reported as analogs to the R² in linear regression. They measure the improvement in log-likelihood from the null model (intercept only) to the fitted model. They should not be interpreted the same way as R² in linear regression — they are not the proportion of variance explained. McFadden's R² values of 0.2 to 0.4 are considered excellent fit in logistic models, whereas such values would indicate poor fit in linear regression. For model selection and information criterion approaches, see the guide on AIC and BIC in statistical modeling.
Cross-Validation
A single train-test split can give unreliable performance estimates, especially with small datasets. K-fold cross-validation divides the data into k equal subsets, trains the model on k−1 subsets, and evaluates on the remaining subset, rotating through all k folds. The average performance across folds is a more stable estimate of real-world model performance. The full methodology for this and other resampling approaches is covered in the guide on cross-validation and bootstrapping.
Logistic Regression Assignment or Project Due Soon?
From Python/R implementation to full statistical write-up with interpretation — our data science and statistics experts deliver complete, accurate, well-documented work fast. Available 24/7.
Start Your Order Log InReal-World Applications
Real-World Applications of Logistic Regression
Logistic regression is not just a classroom exercise. It underpins decisions in healthcare, finance, criminal justice, marketing, and public policy. Understanding where and why it is used — and what makes each application unique — is the mark of a student who genuinely understands the model, not just its mathematics.
Healthcare and Epidemiology
Logistic regression is the most widely used analytical method in clinical and epidemiological research. Clinicians and epidemiologists use it to estimate the probability of disease given a set of risk factors, to identify independent predictors of adverse outcomes after adjusting for confounders, and to develop diagnostic prediction models. The Framingham Heart Study — one of the longest-running cardiovascular cohort studies in the United States — used logistic regression to develop the Framingham Risk Score for predicting 10-year cardiovascular disease risk. Virtually every odds ratio you encounter in a clinical journal article comes from a logistic regression model. According to a review published in the American Journal of Epidemiology, logistic regression remains the analytic method of choice for binary outcomes in observational health studies.
Credit Risk and Finance
Banks and financial institutions have used logistic regression to model credit risk for decades. The classic application is predicting whether a loan applicant will default. The model takes predictor variables — credit score, income, debt-to-income ratio, employment history, loan amount — and estimates the probability of default. Financial regulators in the United States and United Kingdom require that credit scoring models be interpretable, which is one reason logistic regression has remained dominant in this field despite the availability of more complex machine learning models. The Basel III regulatory framework explicitly recommends interpretable statistical models for credit risk, which aligns naturally with logistic regression's transparent coefficient structure.
Spam Detection and Natural Language Processing
Email spam filtering was one of the earliest machine learning applications, and logistic regression was among the first algorithms used at scale. The model takes features extracted from email text — frequency of certain words, sender reputation, presence of links — and predicts the probability that the email is spam. Google's early spam detection system used logistic regression extensively. While modern production systems use gradient boosting and deep learning, logistic regression remains a strong baseline and is still used in text classification tasks where interpretability matters. For data science students, implementing a text-based logistic classifier is a standard project in natural language processing courses.
Social Science Research
Survey-based social science research frequently produces binary outcomes — did the respondent vote?, do they support a policy?, are they employed? Logistic regression allows researchers to estimate the independent effect of demographic, economic, and attitudinal variables on these binary outcomes while controlling for confounders. The American National Election Studies (ANES) datasets, maintained by the University of Michigan and Stanford University, are regularly analyzed using logistic regression. The same framework applies in sociology, criminology, and public administration. For students handling survey and administrative data, the guide on qualitative versus quantitative data provides helpful context for understanding when logistic regression is the right choice.
Marketing and Customer Analytics
Logistic regression predicts customer churn, purchase conversion, and campaign response rates. A marketing team at a subscription service might build a churn model using features like usage frequency, subscription tier, customer tenure, and support ticket volume, predicting which customers are most likely to cancel in the next 30 days. The odds ratios from the model tell the team which factors most strongly predict churn — actionable intelligence for targeted retention campaigns. For students in business programs, the guide on marketing strategy provides context for how analytical models like logistic regression feed into broader business decision-making.
Criminal Justice and Recidivism Prediction
Logistic regression models are used in the United States criminal justice system to assess the risk of recidivism — the probability that a released offender will commit another crime. Tools like the COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) instrument, used by courts and parole boards, are built on logistic regression frameworks. These applications are scientifically powerful but ethically complex: ProPublica's 2016 investigation found evidence of racial disparities in COMPAS risk scores, igniting a major public debate about algorithmic fairness in criminal justice. This remains an important area of critical inquiry for students in law, sociology, and data ethics courses.
Logistic vs. Linear Regression
Logistic Regression vs. Linear Regression: Key Differences
The comparison between logistic regression and linear regression is one of the most tested concepts in statistics courses and data science interviews. They share a linear model structure but differ fundamentally in their output, assumptions, cost functions, and appropriate applications. Understanding these differences precisely is essential.
Logistic Regression
- Output: Probability between 0 and 1
- Task: Classification
- Dependent variable: Categorical (binary, nominal, ordinal)
- Link function: Logit (sigmoid)
- Error distribution: Binomial
- Cost function: Log-loss (binary cross-entropy)
- Estimation: Maximum likelihood
- Goodness of fit: AUC, Hosmer-Lemeshow, pseudo-R²
- Coefficient interpretation: Change in log-odds (OR = eβ)
Linear Regression
- Output: Any continuous value (unbounded)
- Task: Prediction/regression
- Dependent variable: Continuous
- Link function: Identity (no transformation)
- Error distribution: Normal (Gaussian)
- Cost function: Mean squared error
- Estimation: Ordinary least squares (closed form)
- Goodness of fit: R², RMSE, F-statistic
- Coefficient interpretation: Unit change in outcome per unit change in predictor
Why Not Use Linear Regression for Classification?
This is a question that genuinely matters. Linear regression can technically be applied to a binary (0/1) outcome — this is called the linear probability model (LPM). In economics and some social sciences, it is sometimes used as a first approximation. But it has serious problems. First, it can produce predicted probabilities outside [0, 1] — negative probabilities and probabilities above 1 are mathematically meaningless. Second, the error terms are necessarily heteroscedastic (violating a key linear regression assumption) when the outcome is binary. Third, the relationship between a predictor and a binary outcome is almost never linear across the full range of the predictor. Logistic regression solves all three problems simultaneously, which is why it replaced the LPM in most fields. For the full theoretical background on linear regression, see the guide on simple linear regression alongside the coverage of polynomial regression for non-linear extensions.
Logistic Regression vs. Other Classification Algorithms
Beyond linear regression, logistic regression is often compared to other classifiers in the machine learning ecosystem. Against decision trees, logistic regression is more stable (less prone to overfitting without pruning) and more interpretable, but it cannot capture non-linear boundaries without explicit feature engineering. Against random forests and gradient boosting, logistic regression typically has lower predictive accuracy on complex datasets but produces directly interpretable coefficients — a decisive advantage in regulated industries. Against support vector machines (SVMs), logistic regression naturally outputs probabilities, which SVMs do not without additional calibration. The choice between these algorithms should be driven by the requirements of the specific problem: interpretability, prediction accuracy, computational resources, and regulatory constraints.
Regularization & Overfitting
Regularization in Logistic Regression: L1, L2, and ElasticNet
Overfitting is a genuine risk in logistic regression, especially with many predictor variables relative to the number of observations. An overfitted model learns the noise in the training data and performs poorly on new data. Regularization is the standard technique for controlling overfitting by adding a penalty to the cost function that discourages large coefficient values.
L2 Regularization (Ridge)
L2 regularization — also called Ridge regression in the linear regression context — adds the sum of squared coefficients multiplied by a penalty parameter λ to the log-loss cost function. This shrinks all coefficients toward zero but does not set any to exactly zero. The result is a model with smaller, more stable coefficients that generalizes better to new data. In scikit-learn's LogisticRegression, L2 is the default (controlled by the parameter C = 1/λ — smaller C means more regularization).
L1 Regularization (Lasso)
L1 regularization — Lasso — adds the sum of absolute values of coefficients to the cost function. Unlike L2, L1 can shrink some coefficients to exactly zero, effectively performing automatic feature selection. This makes it especially useful when you have many predictor variables and suspect that most are irrelevant — L1 will discard them. In Python, use LogisticRegression(penalty='l1', solver='liblinear').
ElasticNet
ElasticNet combines L1 and L2 penalties, controlled by a mixing parameter that determines how much of each is applied. It retains the feature-selection property of L1 while gaining the stability of L2, which matters when groups of correlated features are present. ElasticNet is often the practical first choice when neither L1 nor L2 alone performs well. For the theoretical treatment of these regularization methods in the regression context, see the comprehensive guide on Ridge and Lasso regularization.
Hyperparameter tuning tip: The regularization strength in logistic regression (parameter C in scikit-learn) should be selected using cross-validated grid search rather than by hand.
GridSearchCV or RandomizedSearchCV from scikit-learn can test a range of C values across multiple folds and select the value that produces the best average validation performance. Always tune C on the training set and evaluate the final tuned model on a held-out test set that was not used during tuning.
Common Mistakes
Common Logistic Regression Mistakes in Assignments and Research
Logistic regression is conceptually approachable, but getting the analysis right — from the model setup through to the interpretation — involves numerous judgment calls. The following errors appear regularly in student assignments, research papers, and data science projects. Knowing them in advance will save you marks and credibility.
Mistake 1: Reporting Accuracy on Imbalanced Data
If 90% of your outcome observations are class 0, a model that always predicts class 0 achieves 90% accuracy while being completely useless. Accuracy is not an appropriate primary metric for imbalanced classification. Report AUC-ROC, F1-score, precision, and recall instead. If the dataset is severely imbalanced, consider techniques like SMOTE (Synthetic Minority Oversampling Technique), cost-sensitive learning, or adjusting the classification threshold before concluding the model performs well or poorly.
Mistake 2: Forgetting to Scale Features
While logistic regression can technically run without scaling, unscaled features of vastly different magnitudes (income in dollars vs. age in years) make coefficients non-comparable and can slow convergence. Always standardize continuous predictors before fitting, especially when using regularization — the penalty is applied uniformly across coefficients, and unscaled features cause the regularization to disproportionately penalize large-scale predictors.
Mistake 3: Not Checking for Multicollinearity
Including two highly correlated predictors in the same model inflates standard errors and destabilizes coefficient estimates. Run a VIF analysis before interpreting any multiple logistic regression. This is one of the assumptions students most frequently skip because it requires an extra step. The output of any published logistic regression should include a statement about multicollinearity diagnostics — if it does not, that is a red flag.
Mistake 4: Interpreting Coefficients as Probabilities
The raw coefficients from logistic regression are in log-odds units, not probability units. Saying "a one-unit increase in X increases the probability of the outcome by 0.45" is incorrect if 0.45 is the raw coefficient. The correct statement is "a one-unit increase in X increases the log-odds of the outcome by 0.45, corresponding to an odds ratio of 1.57 (e^0.45)." Converting to predicted probabilities requires specifying the values of all other predictors — the effect on probability is not constant across the predictor's range.
Mistake 5: Using Logistic Regression with Too Few Events
Running a multiple logistic regression with 5 predictors on a dataset with only 30 positive events (EPV = 6) will produce overfit, unreliable coefficients. The model will appear to fit the training data well but will fail badly on new data. Always compute your EPV before running the model. If EPV is below 10, reduce the number of predictors, collect more data, or use penalized regression (Firth logistic regression or ridge-penalized maximum likelihood) designed specifically for small-sample logistic modeling.
⚠️ Data leakage: A logistic regression model that appears to achieve near-perfect AUC may be suffering from data leakage — where information from the test set inadvertently entered the training process, or where a predictor variable contains information that would not be available at prediction time in the real world. Always build the model as if you are in the past making predictions about the future. Leakage is the most common source of inflated performance estimates in student projects and published papers alike.
Mistake 6: Ignoring the Proportional Odds Assumption in Ordinal Models
When using ordinal logistic regression, the proportional odds assumption — that the odds ratio for each predictor is the same across all outcome thresholds — must be tested. The standard test is the Brant test or a score test for proportional odds. If the assumption is violated, alternatives include the partial proportional odds model, generalized ordered logit, or multinomial logistic regression (at the cost of ignoring the ordering). Reporting ordinal logistic regression without testing this assumption is a methodological error in formal research contexts.
R Implementation
Logistic Regression in R: A Quick Reference
R remains the preferred language for statistical analysis in academia, particularly in epidemiology, public health, political science, and psychology. Many university statistics courses and journal submission requirements are built around R output. Here is the essential code for fitting and interpreting a logistic regression model in R.
R
# Fit logistic regression model <- glm(outcome ~ predictor1 + predictor2 + predictor3, data = your_data, family = binomial(link = "logit")) # Summary with coefficients, z-stats, p-values summary(model) # Odds ratios with 95% CI library(MASS) exp(cbind(OR = coef(model), confint(model))) # Hosmer-Lemeshow goodness of fit library(ResourceSelection) hoslem.test(model$y, fitted(model), g = 10) # AUC-ROC library(pROC) roc_obj <- roc(your_data$outcome, fitted(model)) auc(roc_obj) plot(roc_obj, col = "#2563EB", lwd = 2)
R's glm() function with family = binomial is the standard logistic regression implementation. The output includes coefficients on the log-odds scale, standard errors, z-statistics, and p-values. The exp(coef()) call converts coefficients to odds ratios. The confint() function computes profile likelihood confidence intervals, which are preferable to Wald CIs for logistic regression. For students dealing with survey or clustered data, the survey package provides design-adjusted logistic regression that accounts for complex sampling.
One key R-specific nuance: the default contrasts in R's glm() function use treatment coding (dummy coding) for factors, where the first level alphabetically becomes the reference category. Always check which category is the reference and relevel your factor if needed using relevel(factor_var, ref = "your_reference") before fitting the model. Getting the reference category wrong is a small but frequent source of confusion in results interpretation.
Advanced Topics
Advanced Logistic Regression Topics for Upper-Level Courses
Students in upper-division statistics, data science, epidemiology, or machine learning courses will encounter several extensions and complications of the basic logistic regression framework. These topics appear in graduate-level assignments, capstone projects, and journal articles.
Interaction Terms
Interaction terms in logistic regression test whether the effect of one predictor on the outcome depends on the level of another predictor. For example: does the effect of a medication on disease risk differ between males and females? To test this, you include the product of the two predictors (medication × sex) as an additional term in the model. A significant interaction coefficient indicates effect modification. Interpreting interactions in logistic regression is considerably more complex than in linear regression — the odds ratios for main effects change meaning when interactions are present, and marginal effects must be calculated at specific values of the interacting predictors.
Mixed-Effects (Multilevel) Logistic Regression
When your data has a clustered or hierarchical structure — students nested within schools, patients nested within hospitals, observations nested within individuals — standard logistic regression violates the independence assumption. Mixed-effects logistic regression (also called multilevel logistic regression) handles this by adding random effects that capture the correlation within clusters. In R, the lme4 package's glmer() function fits mixed-effects logistic models. In Python, statsmodels.MixedLM or pymer4 are appropriate. These models are common in educational research, clinical trials, and longitudinal health studies.
Conditional Logistic Regression
Conditional logistic regression is used for matched case-control studies — a research design common in epidemiology where each case (a person with the disease) is matched to one or more controls (persons without the disease) on characteristics like age and sex. Standard logistic regression cannot account for the matching. Conditional logistic regression conditions on the matched sets, eliminating confounding from the matching variables. It is fit using clogit() in R's survival package or statsmodels.ConditionalLogit in Python. For comprehensive coverage of survival analysis and time-to-event outcomes in the same research contexts, see the Kaplan-Meier and Cox proportional hazards guide.
Rare Events Logistic Regression
Standard maximum likelihood logistic regression underestimates the probability of rare events — outcomes that occur in a small fraction of the sample. Gary King and Langche Zeng at Harvard University developed a bias-corrected method called rare events logistic regression (ReLogit), implemented in the R package Zelig. An alternative is Firth's penalized likelihood (available in the logistf R package), which was originally developed for separation problems but also reduces small-sample bias. These methods matter in political science research (rare conflict events), epidemiology (rare diseases), and any context where the outcome occurs in fewer than 5% of observations.
Penalized Logistic Regression for High-Dimensional Data
When the number of predictor variables exceeds or approaches the number of observations — common in genomics, neuroimaging, and text analysis — standard logistic regression fails. Variable selection becomes critical. LASSO logistic regression (via the glmnet package in R or sklearn in Python) performs automatic variable selection through L1 penalization. The path of selected variables as λ increases can itself reveal which predictors are most strongly associated with the outcome. Cross-validated LASSO is now a standard approach in high-dimensional biomedical research. See the comprehensive treatment of factor analysis and dimensionality reduction for complementary variable reduction strategies.
Academic Writing & Reporting
How to Report Logistic Regression Results in Academic Papers and Assignments
Knowing how to run logistic regression is not the same as knowing how to communicate it. Academic journals, graduate theses, and university assignments all have reporting conventions that must be followed. Sloppy reporting is one of the most common ways strong analysis is undermined in student work.
The Methods Section
In your methods section, state clearly: (1) that you used logistic regression, (2) whether it was binary, multinomial, or ordinal, (3) which software and version you used, (4) which predictors were included and why, (5) how you handled missing data, (6) whether and how you checked the model assumptions, and (7) which evaluation metrics you report. Do not simply say "we used logistic regression." That is insufficient — readers need enough detail to reproduce your analysis.
The Results Section
Report coefficients as odds ratios with 95% confidence intervals and p-values, organized in a clear table. Include the sample size, the number of events per predictor variable, and the model's overall fit statistics (AUC or Hosmer-Lemeshow p-value). A good results table for logistic regression looks like this:
| Predictor | Odds Ratio | 95% CI | p-value |
|---|---|---|---|
| Age (per year) | 1.08 | 1.04 – 1.12 | <0.001 |
| Female sex (ref: male) | 0.72 | 0.55 – 0.93 | 0.012 |
| Hypertension (yes vs. no) | 2.34 | 1.76 – 3.11 | <0.001 |
| Smoking history (yes vs. no) | 1.88 | 1.42 – 2.49 | <0.001 |
| BMI (per unit) | 1.05 | 1.01 – 1.09 | 0.022 |
Always specify what "ref" (reference group) is for categorical predictors. Always state the total N and the number of outcome events. Include overall model performance (AUC, pseudo-R²). For student assignments, match the reporting format to what your rubric or course guidelines specify — some courses require SPSS output tables, others require APA format, others want R or Python output attached as an appendix. The guide on research paper writing covers broader academic writing conventions that apply across analytical disciplines. For correct statistical reporting style and grammar, the guide on common grammar mistakes in student essays is a useful parallel reference.
Citing the Statistical Method
Many journals require citation of the logistic regression method itself, particularly for extensions like Firth's method or LASSO. The foundational Cox (1958) paper in Annals of Human Genetics is the standard reference for binary logistic regression. For rare events logistic regression, cite King and Zeng (2001) in Political Analysis. For LASSO, cite Tibshirani (1996) in the Journal of the Royal Statistical Society. For mixed-effects logistic regression in R, cite Bates et al. (2015) describing the lme4 package. Accurate citation of the statistical methodology is taken seriously in journal review and should not be neglected. A thorough guide to writing literature reviews covers the broader citation and source-management skills that apply in statistical research papers as well.
Need a Complete Logistic Regression Report or Analysis?
Our statistics experts handle everything — data preparation, model fitting, assumption checks, interpretation, and academic write-up. Python, R, SPSS, or Stata. Delivered to your deadline.
Order Now Log InFrequently Asked Questions
Frequently Asked Questions About Logistic Regression
What is logistic regression used for?
Logistic regression is used for binary and multi-class classification problems — predicting whether an outcome belongs to one category or another. Common applications include email spam detection, disease diagnosis (does this patient have diabetes?), credit risk scoring (will this applicant default on a loan?), and customer churn prediction. It is one of the most widely used algorithms in both academic research and industry precisely because its outputs — odds ratios and predicted probabilities — are interpretable and statistically grounded.
What is the difference between logistic regression and linear regression?
Linear regression predicts a continuous numerical output — a salary, a temperature, a test score. Logistic regression predicts the probability that an observation belongs to a specific class, and that probability is always between 0 and 1. The key technical difference is the link function: logistic regression applies the sigmoid function to a linear predictor, mapping any real-valued input to a probability. The cost functions also differ: linear regression minimizes mean squared error; logistic regression minimizes log-loss via maximum likelihood estimation. Linear regression is appropriate for continuous outcomes; logistic regression is appropriate for categorical outcomes.
What is the sigmoid function in logistic regression?
The sigmoid function transforms any real-valued input into a probability between 0 and 1 using the formula σ(z) = 1 / (1 + e^−z). In logistic regression, z is the linear combination of the predictor variables and their coefficients. When z is large and positive, σ(z) approaches 1 (high probability of the positive class). When z is large and negative, σ(z) approaches 0. At z = 0, σ(z) = 0.5. The S-shaped curve this produces — wide in the middle, flat at the extremes — is what makes logistic regression ideal for probability estimation and classification.
What are the assumptions of logistic regression?
The key assumptions of logistic regression are: (1) the dependent variable must be binary or categorical; (2) observations must be independent of each other; (3) continuous predictors must have a linear relationship with the log-odds of the outcome; (4) there should be little or no severe multicollinearity among predictor variables; (5) the sample should be sufficiently large — at least 10 events per predictor variable; (6) there should be no perfect separation (a predictor perfectly predicting the outcome). Notably, logistic regression does NOT require normally distributed residuals, homoscedasticity, or a linear relationship between predictors and the raw probability of the outcome.
What is an odds ratio in logistic regression?
The odds ratio (OR) measures the change in the odds of the outcome associated with a one-unit increase in a predictor variable, holding all other predictors constant. It is calculated by exponentiating the logistic regression coefficient: OR = e^β. An OR greater than 1 means the predictor increases the odds of the outcome; an OR less than 1 means it decreases the odds; an OR of exactly 1 means no association. Odds ratios are the standard unit of interpretation in logistic regression research and clinical studies — they should always be reported with 95% confidence intervals and p-values.
How do you evaluate a logistic regression model?
Logistic regression models are evaluated using several complementary metrics. The confusion matrix breaks down predictions into true positives, true negatives, false positives, and false negatives. Accuracy is useful only for balanced datasets. Precision, recall, and F1-score are more informative when classes are imbalanced. The AUC-ROC (Area Under the Receiver Operating Characteristic Curve) measures overall discrimination ability — how well the model separates classes across all thresholds — and is the standard metric for comparing logistic models. Goodness-of-fit is assessed using the Hosmer-Lemeshow test and pseudo-R² statistics. Cross-validation provides robust, generalization-focused performance estimates.
What is the difference between binary and multinomial logistic regression?
Binary logistic regression predicts between exactly two outcome classes — pass/fail, yes/no, disease/no disease. It uses the sigmoid function and produces a single set of coefficients. Multinomial logistic regression extends this to three or more unordered outcome classes — for example, predicting which political party a voter will choose (Democrat, Republican, or Independent). It uses the softmax function to produce simultaneous probabilities for each class, all summing to 1. Ordinal logistic regression is a third type that handles three or more ordered categories (e.g., low, medium, high satisfaction) using the proportional odds model.
Can logistic regression handle multiple independent variables?
Yes. Multiple logistic regression (multivariable logistic regression) includes two or more predictor variables, both continuous and categorical, simultaneously. Each predictor gets its own coefficient, and the model estimates the probability of the outcome as a function of all predictors at once. The crucial advantage of the multiple form is that it estimates the independent effect of each predictor while statistically controlling for all others — essential for separating the contributions of correlated variables and adjusting for confounders in observational research.
What is the rule of thumb for sample size in logistic regression?
The most widely cited guideline is the "events per variable" (EPV) rule: you need at least 10 events (observations in the less common outcome category) for each predictor variable in the model. This rule was proposed by Peduzzi et al. (1996) in the Journal of Clinical Epidemiology. So if you have 5 predictors and 40% of your observations are positive outcomes (events), you need at least 50 events — meaning a total sample of at least 125. EPV below 10 risks overfitting, unstable coefficients, inflated odds ratios, and unreliable confidence intervals.
How does regularization help logistic regression?
Regularization prevents overfitting by adding a penalty to the cost function that discourages large coefficient values. L2 regularization (Ridge) shrinks all coefficients toward zero, producing a more stable model that generalizes better to new data. L1 regularization (Lasso) can shrink some coefficients to exactly zero, performing automatic feature selection — useful when many predictors are irrelevant. ElasticNet combines both. In scikit-learn, the regularization strength is controlled by the parameter C: smaller C means stronger regularization. The optimal C should be selected using cross-validated grid search rather than set manually.
