Function Definition
Convert plink files between various assemblies.
This function will take a set of text plink files (ped + map) and apply liftover to it.
Things to consider
- You need a set of text files. Use
plink --bfile binary_file --recode --out text_file
to do so. You need your ped and map files to be named the same. - You need to install liftover and plink either from the official websites here and here or through conda (with the bioconda channel): use
conda install ucsc-liftover plink
- You need the appropriate chain file:
- The new files are located in the same folder as the old files and with the suffix “liftedover”
How does it work
This script works by:
- Creating a bed file with all the position from the MAP file
- LiftOver the bed file created in step 1 and create a new lifted bed file.
- Generate a new set of plink files. (And remove SNPs that failed to liftover)
- Update the map file generated in step 3 with the position from the bed files generated in step 2
- Remove temporary files if asked
import pandas as pd
from subprocess import run
from os import remove
def liftover_plink(MAP_FILE, CHAIN_FILE, remove_tmp = True):
# READ OLD MAP FILE
old_map_df = pd.read_csv(MAP_FILE, sep="\t", header=None)
#CREATE THE NEW BED FILE TO BE LIFTED OVER
new_bed_file = old_map_df[[0, 3, 3, 1]]
new_bed_file.columns = ["#chr", "start", "stop", "rsid"]
new_bed_file["stop"] = new_bed_file["stop"]+1 # NEED A RANGE OR IT DOESNT WORK
number_of_initial_probes = new_bed_file.shape[0]
# CHECK CHROMOSOME FORMAT
# UCSC NEEDS THE chr PREFIX
if isinstance(new_bed_file["#chr"].loc[0], str): # IF THE NAME IS A STRING
if "chr" in new_bed_file["#chr"].loc[0]:
pass
else:
new_bed_file["#chr"] = "chr" + new_bed_file["#chr"].astype(str)
print("Added chr to chromosome/seq/contig names")
else:
new_bed_file["#chr"] = "chr" + new_bed_file["#chr"].astype(str)
print("Added chr to chromosome/seq/contig names")
new_bed_file.to_csv("tmp_before_bed.bed", index = None, sep = "\t")
# RUN LIFTOVER FIRST
run(["liftover", "tmp_before_bed.bed", CHAIN_FILE, "tmp_after_bed.bed", "unmapped.txt"], capture_output=False)
# GET THE FAILED ONES
failed_positions = []
with open("unmapped.txt","r") as infile:
with open("failed_probes.txt", "w") as outfile:
for line in infile.readlines():
line = line.strip()
if not line.startswith("#"):
failed_positions.append(line.split("\t")[-1])
outfile.write(line.split("\t")[-1]+"\n")
print("{} out of {} ({:.2f}%) position failed to liftover".format(len(failed_positions), number_of_initial_probes, (len(failed_positions)/ number_of_initial_probes) * 100))
# Remove Failed probes using plink
print("Removing failed probes with plink an generating new files")
run(["plink", "--file", MAP_FILE.replace(".map", ""), "--exclude", "failed_probes.txt" ,"--recode", "--out", MAP_FILE.replace(".map", ".liftedover")], capture_output=True)
# READ LIFTED BED
lifted_bed_df = pd.read_csv("tmp_after_bed.bed", sep="\t", header = None)
lifted_dict = dict(zip(lifted_bed_df[3], lifted_bed_df[1]))
# UPDATE FINAL MAP FILE
print("Updating MAP file")
final_map_df = pd.read_csv(MAP_FILE.replace(".map", ".liftedover.map"), sep="\t", header = None)
final_map_df[3] = final_map_df[1].map(lifted_dict)
final_map_df.to_csv(MAP_FILE.replace(".map", ".liftedover.map"), sep="\t", header = None, index = None)
# REMOVING TMP FILES
if remove_tmp:
print("Removing temporary files")
remove("tmp_before_bed.bed")
remove("tmp_after_bed.bed")
remove("unmapped.txt")
remove("failed_probes.txt")
Example Usage
Use these as examples:
liftover_plink(CHAIN_FILE = "hg18ToHg19.over.chain.gz", # The chain file between assemblies
MAP_FILE = "example_file.hg18.map", # the map file WITH MAP EXTENSION
remove_tmp = True) # Removing or not the temporary files