Automate SEO factors testing using Behat

In our previous blog post Automate SEO factors testing using Behat - Part 1, we covered SEO factors - meta tags, keyword and image optimization. Here we are going to cover the remaining SEO factors.

Redirection

Redirects are the technique to forward users and search engines from an old URL to the correct URL.

Types of Redirection:

301 Moved Permanently is a permanent redirect which is best to be implemented for SEO ranking.
302 Found/Moved Temporarily is a temporary redirect used indicates that requested resource has been temporarily moved to new URL
307 Moved Temporarily is also a temporary redirect similar to 302, the only difference is that the HTTP method remains the same in the request.

Sample Scenario:

Given I am on “/redirect.php”

Then the response status code should be 301

And I should be redirected to “/redirect/redirect.php”

RedirectContext method:


public function iShouldBeRedirected(string $url): void
 {
        $headers = array_change_key_case($this->getSession()->getResponseHeaders(), CASE_LOWER);

        Assert::keyExists($headers, 'location');
        if (isset($headers['location'][0])) {
            $headers['location'] = $headers['location'][0];
        }

        Assert::true(
            $headers['location'] === $url || $this->locatePath($url) === $this->locatePath($headers['location']),
            'The "Location" header does not redirect to the correct URI'
        );
        $this->getClient()->followRedirects(true);
        $this->getClient()->followRedirect();
    }

Robots.txt

The robots.txt is a text file that instructs the search engines crawlers which page of the site is accessible. In a robots.txt file, we can specify allow or disallow rules for all user-agents or specific user-agent(s). When the file contains a rule that applies to only one user-agent, a robot will follow the URLs/sitemap specified for it. To ensure robots.txt file is found, always include it in the root domain. In case of robots.txt file is added in the subdirectory, it would not be discovered by robots and the complete page would be crawled.

Basic format:

User-agent: [user-agent name]

Disallow: [URL string not to be crawled]

Allow: [URL string to be crawled]

(Note: Only applicable for Googlebot)

Crawl-delay: [Time in seconds for the crawler to pause crawling the page]

Sitemap: [XML Sitemaps associated with the URL]

Sample Scenario:

Given I am a "Googlebot" crawler

Then I should be able to crawl "/crawl-allowed"

RobotsContext method:

‍


public function iShouldBeAbleToCrawl(string $resource): void
{     Assert::true($this->getRobotsClient()->userAgent($this->crawlerUserAgent)->isAllowed($resource),
            sprintf(
                'Crawler with User-Agent %s is not allowed to crawl %s',
                $this->crawlerUserAgent,
                $resource
            )
        );
 }
private function getRobotsClient(): UriClient
    {
        return new UriClient($this->webUrl);
    }

Sitemap.xml

The sitemap is an XML file that provides search engines with the list of URLs, available for crawling on a particular website. A sitemap is mostly useful when a website is new with few links or content is large or it has rich media content. In such scenarios, a sitemap provides Google with pages that are more valuable and informative on the website.

For large websites, we may end with many sitemaps. Here, we can split large sitemaps using the sitemap index. Following is the XML tags for sitemap index file :

sitemapindex - The parent tag of the file.
sitemap - The parent tag for each sitemap listed in the file
loc - The location of each sitemap

Sample Scenario:

Given the sitemap "/sitemap/valid-sitemap.xml"

Then the sitemap should be valid

SitemapContext method:

‍


public function theSitemapShouldBeValid(string $sitemapType = ''): void
{
       $this->assertSitemapHasBeenRead();
      switch (trim($sitemapType)) {
            case 'index':
                $sitemapSchemaFile = self::SITEMAP_INDEX_SCHEMA_FILE;
                break;
            case 'multilanguage':
                $sitemapSchemaFile = self::SITEMAP_XHTML_SCHEMA_FILE;
                break;
           default:
                $sitemapSchemaFile = self::SITEMAP_SCHEMA_FILE;
        }
        $this->assertValidSitemap($sitemapSchemaFile);
    }
private function assertSitemapHasBeenRead(): void
    {
        if (!isset($this->sitemapXml)) {
            throw new InvalidOrderException(
                'You should execute "Given the sitemap :sitemapUrl" step before executing this step.'
            );
        }
    }
private function assertValidSitemap(string $sitemapSchemaFile): void
    {
        Assert::fileExists(
            $sitemapSchemaFile,
            sprintf('Sitemap schema file %s does not exist', $sitemapSchemaFile)
        );
        Assert::true(
            @$this->sitemapXml->schemaValidate($sitemapSchemaFile),
           sprintf(
                'Sitemap %s does not pass validation using %s schema',
                $this->sitemapXml->documentURI,
                $sitemapSchemaFile
            )
        );
    }

Extension module “marcortola/behat-seo-contexts” also provides the following validations:

Then the sitemap URLs should be alive
Then the multilanguage sitemap should pass Google validation
Then /^the sitemap should have ([0-9]+) children$/

Schema.org Markup

Schema.org is a collaborative effort between Google, Bing, Yandex, and Yahoo to create structured data markup. This will provide information to search engines about the page and enhance rich results experience.

Sample Scenario:

Given I am on homepage

Then the page HTML markup should be valid

HTMLContext method:

‍


public function thePageHtmlMarkupShouldBeValid(): void
    {
        $validated        = false;
        $validationErrors = [];
        foreach (self::VALIDATION_SERVICES as $validatorService) {
            try {
                $validator        = new Validator($validatorService);
                $validatorResult  = $validator->validateDocument($this->getSession()->getPage()->getContent());
                $validated        = true;
                $validationErrors = $validatorResult->getErrors();
                break;
            } catch (ServerException | UnknownParserException $e) {
                // @ignoreException
            }
        }
        if (!$validated) {
            throw new PendingException('HTML validation services are not working');
        }
        if (isset($validationErrors[0])) {
            throw new InvalidArgumentException(
                sprintf(
                    'HTML markup validation error: Line %s: "%s" - %s in %s',
                    $validationErrors[0]->getFirstLine(),
                    $validationErrors[0]->getExtract(),
                    $validationErrors[0]->getText(),
                    $this->getCurrentUrl()
                )
            );
        }
    }

HTTP Status code

HTTP status code is a three-digit response sent by a server for a browser's request.

Common status code classes:

1xxs – Informational responses
2xxs – Success!
3xxs –Redirection
4xxs – Client errors. It is a good practice to return a 404 error page when the correct URL is not found.
5xxs – Server errors

We can use the existing step - “Then the response status code should be 301”, to validate HTTP response code.

Hreflang Tag

Hreflang tag is an attribute which helps search engines to show the correct version of the page based on a user's location and language preferences

Format: <link rel="alternate" href="http://example.com" hreflang="en-us" />

rel = “alternate” - Indicates that content exists in alternate language(s)
href - Specifies the URL of the content
hreflang=“x” or “x-default” - Hreflang shows the relationship between web pages in the alternate languages.
Format of “x” is language code or language - country code. It is used when a page exists in a particular language. For ex. hreflang = “es” or hreflang = “es-mx”. (Note: Language code is always before country code)
hreflang="x-default" is used when there is no language/region match for a page.

One of the rules is that hreflang tags are bidirectional/reciprocal. Bidirectional means - When an English page is linked to Spanish page, then Spanish page must link back to the English page.

HTML Markup:

Sample Scenario:

Given I am on “/valid-hreflang.html”

Then the page hreflang markup should be valid

LocalizationContext method:

‍


public function thePageHreflangMarkupShouldBeValid(): void
 {
        $this->assertHreflangExists();
        $this->assertHreflangValidSelfReference();
        $this->assertHreflangValidIsoCodes();
        $this->assertHreflangCoherentXDefault();
        $this->assertHreflangValidReciprocal();
 }
private function assertHreflangExists(): void
    {
        Assert::notEmpty(
            $this->getHreflangElements(),
            sprintf('No hreflang meta tags have been found in %s', $this->getCurrentUrl())
        );
    }
private function getHreflangElements(): array
    {
        return $this->getSession()->getPage()->findAll(
            'xpath',
            '//head/link[@rel="alternate" and @hreflang]'
        );
    }

Page Speed

Page Speed (page load time) is the measure of time taken to fully load content on a page.

Some of the ways to increase page speed :

Minify CSS, JavaScript, and HTML files of the website
Enable Leverage browser caching for images, CSS and JS
Minimize Redirects
Optimize images

Extension module provides performance context that covers - Testing HTML minification, Testing CSS minification, Testing JS minification, Testing browser cache and Testing JS loading async or defer. Below is the snippet for CSS/JS minification.

Sample Scenario:

Given I am on "/performance/html/minified.html"

Then HTML code should be minified

PerformanceContext method:

‍


public function cssOrJavascriptFilesShouldBeMinified(string $resourceType): void
    {
        $this->doesNotSupportDriver(KernelDriver::class);
        $resourceType = 'Javascript' === $resourceType ? 'js' : 'css';
        foreach ($this->getSelfHostedPageResources($resourceType) as $element) 
      {
            if ($url = $this->getResourceUrl($element, $resourceType)) {
                $this->getSession()->visit($url);
            }
            $this->assertContentIsMinified(
                $this->getSession()->getPage()->getContent(),
                'js' === $resourceType ?
                      $this->minimizeJs($this->getSession()->getPage()->getContent()) : $this->minimizeCss($this->getSession()->getPage()->getContent())
            );
            $this->getSession()->back();
        }
    }

URL Optimization

A URL (Uniform Resource Locator) is a human-readable text that specifies the location of the webpage on the internet. A URL has the following basic format: protocol://domain-name.top-level-domain/path

Protocol - The protocol determines how to communicate data between the server and a web browser when sending/retrieving resources. HTTP and HTTPS (secure) are two of the most common protocols.
Domain-name - It is a unique identifier or name of the website.
Top-Level Domain (TLD) - It is an extension to the domain name. For example, .com, .net, .edu, .org, etc.
Path - Path is the exact location of the page/file on the website. The path includes specific folders and/or subfolders where the resource is located.

SEO best practices for URL optimization:

Make URL readable to human and search engines
Match URL with the page title and heading
Use relevant page keywords in the URL
Remove dynamic parameters from the URL

Good URL - https://www.example.com/seo/meta-tags

Bad URL - https://www.example.com/seo?=id=54321

We have covered the major SEO factors that affect the ranking of the site in the SERP and how to automate them using Behat. Hope this blog was informational!

If you'd like to automate the SEO of your site, reach out to business@qed42.com.

Happy Automation!!!

Automate SEO factors testing using Behat - Part 2

Redirection

Robots.txt

Sitemap.xml

Schema.org Markup

HTTP Status code

Hreflang Tag

Page Speed

URL Optimization