Down the Rabbit Hole of Expired Domains

I recently carried out some experiments with finding expired domains that have perceived SEO value. This is a write up to semi document what I did and try to justify to my business partner why I haven’t done any real work for the last couple of weeks.

Red Rag to a Bull

Showing an inefficient process to a programmer is like waving a red rag in-front of a bull (or leaving out bubble wrap where people can pop it). Recently I have kept bumping into people looking for expired domains that have existing backlinks and various SEO metrics.

The process most people seemed to follow was checking a site for broken links. Logic being that it could be an expired domain and the fact one person linked to it means others probably did as well.

This struck me as a long winded process and thus my brain was engaged on a loop that has led to a few late nights and more than a few high caffeine beverages.

Seeing How Many I Could Find

I know a little about how the internet works and was aware of the existence of a magical entity known as the “Zone File”. The zone file keeps a list of every DNS record which is basically a pointer that connects a domain name to a web servers IP address. When a DNS record disappears from the zone file there is a good chance this is because the domain is no longer registered.

“Simple!” I thought. I’ll just compare an old zone file to a current zone file and see what is missing from the new file. That will give me a list of all the domain goodies in one fell swoop while everyone else is still crawling their first site.

After a quick Google I had located the most recent zone file and also a zone file from 5 years ago. This would give me a list of all domains registered over 5 years ago that have expired within the last 5 years.

The unprocessed zone files were around 10GB in size which proved interesting to work with. I had to do some Googling and stare at an animated gif on the Wikipedia page for external merge sort algorithms longer than I would care to admit.

After a little tidying up the input files were under 2GB and I could do some programming wizardry to compare both files line by line and establish which domains were good to go.

Domains for You, and You, and You….

Final result; 44 million dropped domains (queue mic drop, hand slapping gesture and walk-away).

A lot of these domains were pure spam and some were still registered, they had just deleted their DNS record for some other mysterious reason. I needed to clean the list up a bit and make it useful.

In a naïve fashion I randomised the list, took about a 2 million record segment and then ran it through some APIs to get the raw SEO stats.

That turned up about 200,000 that had some sort of DA or TF value. I ordered it by TF or DA (whichever flavour you prefer) and sent out a few lists for the good of the people and rapidly patted myself on the back. I figured I could now turn my attentions back to my actual work.

Thanks But No Thanks!

It quickly became clear that sharing these domains with people had induced a level of keyboard rage I had not anticipated.

In my haste I had failed to realise that your average SEO dude probably doesn’t want to register HUGEGAYBONERS.COM despite it’s high DA value.

Now at this point, your average healthy sane individual would admit their mistake and slink away with their tail between their legs and do things like client work and things that pay the bills. So of course I doubled down and decided to jump deeper into the mysterious world of expired domains.

Getting Lost in The Rabbit Hole

I had to figure out a way to filter the list and at least make a more family friendly version minus all the mucky domains.

I experimented with creating short lists based on keywords contained inside the domain. For instance, I extracted 120,000 domains with the keyword “blog” in them. I could also create lists based on negative keywords.

The results from this were pretty poor and still very spammy. I noticed a lot of domains were just pure gibberish or in a different language. One thing that struck me though was that genuine sites had permalinks structures like “/this-is-my-article-title/”. I figured I would have a play with extracting keyword data.

Luckily a history of most sites is stored at the Wayback Machine so I requested a full listing of all pages for each domain and proceeded to extract words that might be useful in categorising the domain.

Anyone Got a Spare Supercomputer I Could Borrow?

Initially I downloaded a dictionary of every English word. I then ran all string combinations for every page URL saved on Wayback Machine, fitting a minimum and maximum word size, through the dictionary to see which ones were real English words. I also counted the number of occurrences of each word so I could create a keyword density as a metric later to order domains by how targeted they were to a particular niche.

This worked faster than I expected but my PC fans kicked in and my room warmed up a little while it ran. My mistake though was running every combination. This meant instead of getting the word “disappointment” I would get “men”, “point”, “appoint”, etc… you get the picture.
Instead I had to re-write this to first identify the largest possible word and then exclude that part of the string from further word matches. I was getting tired at this point and made a lazily coded nested loop to do this for me.

After hitting “Run” on my project my PC truly screamed for mercy and the speed was, shall we say, not good? A rough inaccurate calculation done with a stopwatch told me the process would probably be finished the 44 million in about 200 days. Great. So, I just need to have my CPU maxed for 200 days then all will be good.

In true narrow sighted splendour, instead of walking away at this point, I decided to double down and optimise my code. I stayed up late and after giving this my full focus I had the totally inaccurate estimated running time down to only 60 days. Yay?

Let’s Agree That No-one Agrees

At this point, more “constructive criticism” had filtered through to me. I figured initially people would be more concerned with the niche of the domain and previously hosted site. Unbeknownst to me, there was a whole eco system of expired domain seller in existence who had set the laws for what makes a good domain.

I made a couple of posts in online forums and discovered that opinions are in the many but hard data is all but absent.

The general “wisdom” was that raw stats weren’t worth anything and, like second hand cars, it was more about the history and what it had been used for in the past that counted.

That does make sense but some of the conspiracy level theories on the forums are a bit extreme. I have helped build a PBN recently and many of the domains had several previous owners in various industries and they indexed immediately without any issue

A Quick Reality Check

I decided to stop aimlessly coding and talk to a few people who sold expired domains. This was now getting beyond a few hours diversion and becoming a full-blown project. Something my business partner had warned me against right at the start (they seem to know me better than I do).

It seemed that the mecca for a domain seller is a domain with a reasonable DA/TF and decent referring domain count without any previous history of SEO guys using the domain. So just a genuine looking site with no dodgy Chinese looking pages or previous PBN usage.

Most of the guys I talked to seemed to be using a software “Domain Hunter Gatherer” that scraped lists from somewhere and then had a few filters. This gave me an “A-Ha” moment. (yes, I totally had a moment, I didn’t see the other software’s filter functions and decide to copy them. That isn’t what happened at all…. Ok, maybe a little but I decided to do a bit of expansion)

A “Simple” Plan

I decided to have a staged filter process. I would run tests that completed very quickly first. If those tests failed then subsequent tests would be skipped. This would speed things up. The main mental shift was that I was now trying to filter spam as quick as possible so I would reduce the size of the data set. I could then run the more hardcore CPU intensive keyword processes on a much lower number of “clean” domains.

After a little research into what people considered spam and poking around dodgy sites to see what tricks they used I had a simple fool-proof process that I will show you in a boring list format because I am too lazy to make a diagram:

Stage 1) Request list of all indexed pages at the Wayback machine. Check each URL to see if it contains any word in my URL filters. I made separate ones for porn, gambling, pharma etc.. so I could classify in database why I was failing the domain (in-case people flamed me for excluding their niche or if I ever decide to do an in-depth keyword study for midget porn).

Stage 2) I noticed most nefarious websites would block the wayback machine from indexing the site with their robots.txt. In fact most people selling the domains seemed to be selling not actual “clean” domains but ones that appeared to be clean because they didn’t have any obvious signs in the Wayback Machine. I of course was not satisfied with this so I decided to eliminate any website that at any point in it’s history had ever blocked a particular user agent. I did this by requesting all version of robots.txt from Wayback and parsing them accordingly (even if you block Wayback, they will still keep a record that you blocked them. Clever Wayback Machine!)

Stage 3) Check every version of the homepage stored for my various categorized “bad word” lists and store which ones it found (you know, for weirdos who are interested in that kind of thing).  I also plugged in a language detection library and record the detected language of every version of their homepage stored.

Ta-da! Told You It Was Easy

So, after all of that palaver, what do I have? Some pretty good data, that’s what!

Not everything is entirely automated, there are still a few stages I would like to code:

Stage 4) Get anchor text from Majestic API and run it against bad keyword lists

Stage 5) Lookup domain availability automatically (doing this manually at the moment with an online tool)

Stage 6) Run my funky keyword analysis to keep me warm through winter and also allow me to pick out non-obvious niche domains

Putting My Head Above Ground

It’s really time to get back to work and I feel like I have done enough here to prove a point. If I get a positive feedback this time after sharing the data then I may carry on working on it.

The fact that no one really agrees on what they want makes it kind of tough to condense everything down to a small list. At first, we will probably experiment with a few different small lists based on different metrics to gauge people’s reactions but ultimately I should probably put the entire database online and let people use my code to check domains for a hidden spam history.

I am not sure, I will wait and see if anyone says anything positive and if I can get the time.

If you want the domain lists, click here and sign up, it’s free! . The list goes out every Friday.

Leave a Comment:

1 comment
Ian says October 5, 2017

A most excellent write up and your partners should definitely cut you some slack. You’re on to something there you clever, clever chap.

Add Your Reply