This post was co-authored by Manikam Muthiah, Sr. Principal Software Engineer

Web scraping attacks are a persistent, industry-wide menace. Bad actors use automated bots to extract and copy information and data from websites without the consent of the business owner. The attackers then use the data for malicious purposes.

There are few different types of web scraping. Price scraping is among the most common types and often affects e-commerce, retail, and travel/hospitality websites. And it isn’t just malicious actors who do price scraping; it’s also competitors looking to undercut you on price. They scrape your catalog and use that information to match or beat you on prices. Another example is image scraping, where attackers can use scrapers to steal your exclusive, copyrighted images, which can damage your brand.

An attacker could also scrape the entire content of your website and use the data to create a mirror site to divert your traffic. This can lead to a loss of business for you, as well as additional attacks on users when they land on the malicious mirror site.

Bots can be used to scrape anything on the Internet — documents, images, HTML and CSS code, and more. But not all bots are bad. Some such as Google bots are vital and are used for marketing a business’s products and service. It’s critical to be able to distinguish the bad bots from the good ones.

How Does Citrix Application Delivery and Security Help?

In my last blog post, we looked at how Citrix is leveraging machine learning and behavior-based analytics to detect complex and sophisticated bot attacks against apps and APIs.

We are constantly updating and adding new machine learning models to our cloud-delivered Citrix Application Delivery Management (ADM) service. Now, using a combination of behavioral features to spot scrapers, we can detect website scraping violations.

From a behavioral point of view, scrapers are quite different from legitimate clients. Legitimate clients typically access a website to complete a certain workflow or to accomplish a task. For example, if you were looking for a pair of sneakers on an e-commerce site, you’d probably explore a few of the popular options and search based on your needs. You wouldn’t look at every single pair of sneakers on the site. A price scraper would likely scrape them all.

Scrapers’ access patterns also tend to differ based on temporal characteristics, and they tend to access all different kinds of website content — text, HTML, scripts, and more. For example, a price scraper is probably not interested in your fonts/js/images. Also, scrapers tend to be more voluminous than clients, depending on the number of products they scrape. People may also write simple bots to keep track of a few bookmarked items; these would tend to produce an exact replica of a behavior over and over again.

With our bot insights, we can detect scraping and provide additional details on bot type and category. Based on the data, admins can take action using Citrix’s bot management solution.

Application and SecOps admins can now view website scraping violations by going to the revamped security violations page in Citrix ADM service, which has a violation view with an “app-first” focus.

To view violations, click on Analytics → Security → Security violations in the Citrix ADM service. Select the application you want to view, then select the Bot Violation tab. You can also view the violations under the All Violation tab, under Security Violations.

The timeline graph below shows the number of scraping violations detected across a specific time period.

Fig 1: Content scraping violation on Citrix ADM service

Admins can drill down further into each detection and get details on the client IP address, session ID, bot type, bot category, total html requests, and much more. The bot type and bot category information are both available in the Bot Insights feature. They can use this information to troubleshoot, enhance information collected from other security systems, and improve their decision making.

Fig 2: Drill down view for a content scraper violation

Expanding the violation detail further will give you details about the URL count, user agent used by the client, and the empty referrer count.

Fig 3: Client details

Admins can also use Citrix’s bot management solution to take action on the scrapers. These can include dropping connections, rate limiting, configuring CAPTCHAs, and more. Learn more in our product documentation, and look for additional updates as we continue to enhance our application security analytics and use machine learning to identify sophisticated violations.

Contact a Citrix sales expert for any questions, comments or feedback, or share in the comments below. You can also learn more here.