Skip to content

Bioinfo VCF Filtering Fix

DodaTech Updated 2026-06-26 3 min read

You will learn how to filter VCF variants by quality, depth, and annotation impact.

The Problem

The bioinfo vcf filter pattern is frequently misapplied by data scientists and Python developers, leading to runtime errors, incorrect results, or inefficient code. This quick-fix guide shows the correct implementation and common pitfalls to avoid when working with BIOINFO in Python.

The Wrong Way

The most common mistake is using the wrong method signature, incorrect parameters, or misunderstanding the underlying data structure. Here is what typically goes wrong:

import vcf
def read_and_filter_vcf(path, min_qual=30, min_dp=10):
    reader = vcf.Reader(open(path))
    passed = []
    for record in reader:
        if record.QUAL and record.QUAL >= min_qual:
            for sample in record.samples:
                if sample['DP'] and sample['DP'] >= min_dp:
                    passed.append(record)
                    break
    return passed
filtered = read_and_filter_vcf('input.vcf', 50, 20)
print(f'{len(filtered)} variants pass filters')

What happens: Filtered variants passing quality and depth thresholds.

This approach fails because the API contract is violated -- parameters are passed in the wrong order, the input shape doesn't match expectations, or the method is called on an incompatible object type.

The Right Way

The correct approach uses the proper API with the right parameters. Here is the fixed version:

from collections import defaultdict
impacts = defaultdict(int)
for record in filtered:
    for ann in record.INFO.get('ANN', ['']):
        imp = ann.split('|')[2] if '|' in ann else 'UNKNOWN'
        impacts[imp] += 1
print(dict(impacts))

Expected output:

{'HIGH': 5, 'MODERATE': 23, 'LOW': 45, 'MODIFIER': 120}

Step-by-Step Fix

1. Understand the data types and shapes

Before applying any operation, verify the data types and shapes of your inputs. In Python Data Science, most errors come from type or shape mismatches.

# Always inspect your data first
print(type(data))
print(data.shape if hasattr(data, 'shape') else 'No shape')
print(data.dtype if hasattr(data, 'dtype') else 'No dtype')

2. Apply the correct method with proper arguments

Use the corrected code shown above. Pay special attention to keyword arguments that control behavior like axis, inplace, or how.

3. Verify the result

Always validate that the output matches expectations before proceeding:

# Verification pattern
result = perform_operation(data)
assert some_condition(result), "Operation failed unexpectedly"
print(f"Success: {result.shape if hasattr(result, 'shape') else result}")

Prevention Tips

  • Filter by QUAL field for mapping quality (Phred-scaled): Filter by QUAL field for mapping quality (Phred-scaled)
  • Filter by sample DP for read depth at variant site: Filter by sample DP for read depth at variant site
  • Filter by GQ for genotype quality when available: Filter by GQ for genotype quality when available
  • Use ANN or CSQ INFO fields for variant annotation (SnpEff/VEP): Use ANN or CSQ INFO fields for variant annotation (SnpEff/VEP)
  • Filter by allele frequency (AF) for rare vs common variant separation: Filter by allele frequency (AF) for rare vs common variant separation

Common Mistakes

  1. Filtering by QUAL alone without also checking depth or genotype quality - Filtering by QUAL alone without also checking depth or genotype quality
  2. Not filtering multiallelic sites properly (each alt allele needs separate evaluation) - Not filtering multiallelic sites properly (each alt allele needs separate evaluation)

These mistakes appear frequently in real-world bioinfo code. DodaTech's contributors have identified these patterns through analysis of open-source projects, production systems, and community forums like Stack Overflow.

Practice Exercise

Filter a multi-sample VCF to keep only high-quality (QUAL>100, DP>20), rare (AF<0.01) missense variants in coding regions.

This exercise reinforces the concepts covered in this guide. Try implementing it before checking online solutions. This hands-on approach ensures you retain the knowledge and can apply it independently.

FAQ

### What is the QUAL field?

Phred-scaled quality score. Q30 = 1/1000 error rate, Q100 = 1e-10 error rate.

What is the ANN field?

SnpEff annotation: predicted effect of variant on transcripts, including impact, gene, and feature type.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro. DodaTech tools integrate seamlessly with Python Data Science workflows for enhanced productivity and security.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro