Simple approach to website keyword extraction

Aug 1, 2022 by Roman Landenband

Keywords are central in the work of marketers and SEO experts. They can be used to model what type of content users are looking for on the web, targeting them via ad-campaigns or improving ranking of a particular web-page in "organic" traffic (ranking in a search result for a given search input).

Historically, websites used the "keywords" meta HTML tag to list the different keywords they wanted search engines to see. Some websites abused this to game the system and rank for keywords that did not have anything to do with the content on that website. As a result, Google and other search engines stopped looking there and then finally websites stopped using the "keywords" meta-tag since competitors where the only ones looking there at that point

We hope that this article helps you extract keywords from various websites. This can help you improve your own website's ranking or the analysis of your competitors' sites.

Why are keywords important?

Keywords are important because they help you reach the right audience, increase your conversion rate and brand recognition. But keywords aren't just used for SEO; they're also used in content, social media and CRO (conversion rate optimization).

What is keyword extraction?

Keyword extraction is the process of finding keywords from websites. Keywords are often found in the title, meta description and body of the text. There are many types of keyword extraction algorithms that each have their own strengths and weaknesses. Let's look at a simple but potent method that is not using machine learning.

Approach outline

Input

If you want to find relevant keywords for a website, you need to think like its creators—but in reverse!

It is important to look at what the creators of the website want you to see.

We will be using content from the H1 tag, the title meta-tag and the description meta-tag.

The H1 html tag on a webpage is often used as the main call-to-action for what they want people to read first on the page.

We look at the title of the homepage. It will usually be a short description of what the business does and may contain some keywords worth paying attention to.

When sharing a webpage on social networks, or anywhere else that requires a description of its contents, the content inside the description meta-tag will be showed, this description will be used to provide more information about what's inside the page.

Processing

We've got some data to work with, but not much. Our input is the H1 tag content, the title meta-tag, and the description meta-tag, while concise, is not much data to go on...

We want to find a simple approach, so we will not be using NLP (Natural Language Processing) or any classical "machine-learning" approaches.

Our main strategy here is doing our best to clean the input and hopefully what remains are some high quality keywords.

Practically, this means, we will be removing stop words - a term that means words like "is", "at", "that" etc.. and splitting the input where-ever they appeared.

Implementation outline

extract H1, title & description meta values
split strings using "stop words" (and removing them)
remove remaining sentences with more than 2/3 words in them (by counting spaces for example..)

That's about it! short and sweet.

This is how we currently extract keywords @ CueTap, this seems to work pretty well so far.