Function Definition

Convert plink files between various assemblies.
This function will take a set of text plink files (ped + map) and apply liftover to it.

Things to consider
  • You need a set of text files. Use plink --bfile binary_file --recode --out text_file to do so. You need your ped and map files to be named the same.
  • You need to install liftover and plink either from the official websites here and here or through conda (with the bioconda channel): use conda install ucsc-liftover plink
  • You need the appropriate chain file:
  • The new files are located in the same folder as the old files and with the suffix “liftedover
How does it work

This script works by:

  1. Creating a bed file with all the position from the MAP file
  2. LiftOver the bed file created in step 1 and create a new lifted bed file.
  3. Generate a new set of plink files. (And remove SNPs that failed to liftover)
  4. Update the map file generated in step 3 with the position from the bed files generated in step 2
  5. Remove temporary files if asked
import pandas as pd 
from subprocess import run
from os import remove

def liftover_plink(MAP_FILE, CHAIN_FILE, remove_tmp = True):
    # READ OLD MAP FILE
    old_map_df = pd.read_csv(MAP_FILE, sep="\t", header=None)

    #CREATE THE NEW BED FILE TO BE LIFTED OVER
    new_bed_file = old_map_df[[0, 3, 3, 1]]
    new_bed_file.columns = ["#chr", "start", "stop", "rsid"]
    new_bed_file["stop"] = new_bed_file["stop"]+1 # NEED A RANGE OR IT DOESNT WORK

    number_of_initial_probes = new_bed_file.shape[0]

    # CHECK CHROMOSOME FORMAT 
    # UCSC NEEDS THE chr PREFIX
    if isinstance(new_bed_file["#chr"].loc[0], str): # IF THE NAME IS A STRING
        if "chr" in new_bed_file["#chr"].loc[0]:
            pass
        else:
            new_bed_file["#chr"] = "chr" + new_bed_file["#chr"].astype(str)
            print("Added chr to chromosome/seq/contig names")
    else: 
        new_bed_file["#chr"] = "chr" + new_bed_file["#chr"].astype(str)
        print("Added chr to chromosome/seq/contig names")

    new_bed_file.to_csv("tmp_before_bed.bed", index = None, sep = "\t")

    # RUN LIFTOVER FIRST
    run(["liftover", "tmp_before_bed.bed", CHAIN_FILE, "tmp_after_bed.bed", "unmapped.txt"], capture_output=False)

    # GET THE FAILED ONES 
    failed_positions = []
    with open("unmapped.txt","r") as infile:
        with open("failed_probes.txt", "w") as outfile:
            for line in infile.readlines():
                line = line.strip()
                if not line.startswith("#"):
                    failed_positions.append(line.split("\t")[-1])
                    outfile.write(line.split("\t")[-1]+"\n")
    print("{} out of {} ({:.2f}%) position failed to liftover".format(len(failed_positions), number_of_initial_probes, (len(failed_positions)/ number_of_initial_probes) * 100))

    # Remove Failed probes using plink
    print("Removing failed probes with plink an generating new files")
    run(["plink", "--file", MAP_FILE.replace(".map", ""), "--exclude", "failed_probes.txt" ,"--recode", "--out", MAP_FILE.replace(".map", ".liftedover")], capture_output=True)

    # READ LIFTED BED 
    lifted_bed_df = pd.read_csv("tmp_after_bed.bed", sep="\t", header = None)
    lifted_dict = dict(zip(lifted_bed_df[3], lifted_bed_df[1]))

    # UPDATE FINAL MAP FILE 
    print("Updating MAP file")
    final_map_df = pd.read_csv(MAP_FILE.replace(".map", ".liftedover.map"), sep="\t", header = None)
    final_map_df[3] = final_map_df[1].map(lifted_dict) 
    final_map_df.to_csv(MAP_FILE.replace(".map", ".liftedover.map"), sep="\t", header = None, index = None)

    # REMOVING TMP FILES
    if remove_tmp:
        print("Removing temporary files")
        remove("tmp_before_bed.bed")
        remove("tmp_after_bed.bed")
        remove("unmapped.txt")
        remove("failed_probes.txt")

Example Usage

Use these as examples:

liftover_plink(CHAIN_FILE = "hg18ToHg19.over.chain.gz", # The chain file between assemblies
               MAP_FILE = "example_file.hg18.map", # the map file WITH MAP EXTENSION
               remove_tmp = True) # Removing or not the temporary files