top of page

Finding hidden hyper links inside a PDF

In a few days from now I will start a new job, I can't disclose the name of the employer for privacy reasons, needless to say the employer is a large corporate in Australia. Today I received a welcome email and a guide to my first day and the initial training I will undertake. In the welcome email I am told there are links embedded (and hidden) and the person who finds the most links wins a prize ! Now not wanting to miss out on the opportunity to win a little prize and impress my new employer, I decided to write a Python script to analyse and extract all hyper links from a pdf !! - May as well use the skills I have learnt :)

Im going to use my IDE of choice, PyCharm community edition.

After about a little while my script is done

The script utilises the PyPDF4 package, which to begin with I had quite a few issues with, but I got there in the end. After importing the packages the script searches the PDF for all the strings that match a pattern, in this case "https?://\S+". It then iterates through all the PDF's pages and extracts the text from the pdf. Next it finds all the strings that match the pattern we defined (URL's) and prints them to the console.


To protect privacy of the company I am not displaying the URL's, but you can test the script yourself by running the script in an IDE, like PyCharm and ensuring you place a PDF file, named file.pdf inside the project folder. Run the script and watch the magic !


After about an hour of tinkering I decided to come back to the script and add a simple GUI which would allow the user to choose their PDF from any location on local disk, the script would then extract the links, even if they are 'hidden' and display the total number of links on the screen!


I reckon I could be in with a chance of winning the competition on day 1 of my new job !!!!!!



0 views
bottom of page