[Dr. WoW Season 2] [No 7] Classification of Massive URLs---URL Filtering Mechanism

Latest reply: May 21, 2018 17:18:39 800 2 0 0

The Internet is as vital as air and water. We can search the Internet for information, news, and videos, which are identified by a unique address, also known as Uniform Resource Locator (URL).

However, all URLs are not good. For example, access to non-work-related, illegal, or malicious websites during working hours compromises productivity, confidentiality of information assets, and network security of enterprises. To block such URLs, the NGFW provides the URL filtering function to control Internet access.

To understand how URL filtering works, we need to know what information a URL contains. Now, let's have a look at the URL format.

URL Format

Essentially, a URL is a character string, short or long, that consists of several fields, as shown in the following example:

20180521155835562001.png

 

The fields are described as follows:

l  Protocol: indicates the protocol, usually, HTTP or HTTPS. URL filtering on Huawei NGFW applies to both protocols. Here we use HTTP as an example. For HTTPS, SSL decryption configuration is required, which will be described later.

l  Host: indicates the domain name or IP address of the web server. If the web server uses a non-standard port (such as 8080), the host field must also contain the port number, such as www.example.com:8080.

l  Path: indicates the directory or file name on the web server, separated by slashes (/).

l  Parameter: indicates the parameters sent to the web pages, usually used for dynamically querying data from the database.

To filter the URL, the NGFW verifies and matches this character string. Generally, the value of the parameter field is complex, meaning that the management overhead will be high if URL filtering is based on this field. Therefore, the NGFW usually filters URLs based on the host and path fields.

With regards to the case issue of URLs, as mentioned in the URL format specifications, the values of the protocol and host fields are case insensitive, and whether the values of the path and parameter fields are case sensitive depends on the settings on the web server. Note that the URL filtering function of Huawei NGFW is case-insensitive in these fields.

To facilitate the filtering of massive URLs, Huawei Security Center classifies them into categories.

Category-based URL Filtering

Predefined category

Currently, Huawei's URL category database contains more than 85 million URLs in 137 subcategories of the 45 categories, and the numbers are still increasing. The following table lists existing categories and the subcategories in the "download" category as examples.

Download

E-book download

Humanity

Sports

Social focus

Military affairs

Social network

Software download

Image download

Music and movie download

Other download

Lottery

Leisure

Religion/Supernature

Sexual topics

Real estate/Home

Jobs

Search/Portal

Government/Politics

Education/Science

News/Media

Travel

Fashion

Vehicle

Legal

Streaming media

Life

IT-related

Forum

Shopping

Business/Economy/Finance

Food/Drink

Community

Vulgar/Horror

Gambling

Drugs

Malicious website

***ographic website

Crime

Inactive website

Weapon

Fraud

Abortion

***

Storage server

Illegal speech

Cult

General sites

Point-to-point (P2P)

Others

 

 

These are predefined URL categories provided by default. When the NGFW receives a website access request, it compares the requested URL against the URLs in predefined categories for a match.

Comparing requested URLs against the massive number of URLs in predefined categories one by one is inefficient. Therefore, the NGFW uses a two-step process. The NGFW caches URLs that have been matched. In the first step, the NGFW compares a requested URL against these URLs for a match. If no match is found, the NGFW then uses the URL categorization service at Huawei Security Center to look for a match. The two-step search process is shown in the following figure.

20180521155836863002.png

 

1.         After initial startup, the NGFW automatically loads the predefined URL category database file and extract the most common URLs to form a database of hotspot URLs in the cache.

2.         An enterprise user enters a URL in the browser address box to access the specified website on the Internet.

3.         The NGFW extracts the URL from the enterprise user's access request and searches the hotspot URL database for a match. This is the first step in the two-step process and is called local cache query. If a match is found, the NGFW takes the action defined for the category. If no match is found, the NGFW proceeds to the next step.

4.         The NGFW initiates a remote query to Huawei Security Service Center (sec.huawei.com). This is the second step in the two-step process. Huawei Security Service Center has much more predefined URLs. After category match is found, the NGFW takes the corresponding action defined for the category. The exchanges between the NGFW and Huawei Security Service Center (sec.huawei.com) will be described later.

5.         The NGFW adds the remote query result to the hotspot URL database. Through this learning process, the hotspot URL database is constantly updated and adapted to reflect the mostly requested URLs that are unique to an enterprise, a region, or a country to reduce the need for remote URL query and accelerate URL filtering.

6.         To ensure that the learning results can survive NGFW power failure or restart, the NGFW regularly saves the cached URL hotspot database to the URL hotspot database file. Upon every subsequent startup, the NGFW automatically loads the URL hotspot database file to ensure that the hotspot URL database is up-to-date.

As for remote query, In remote query, Huawei Security Service Center functions only as a "point of contact", and remote query process also involves several other server roles, including the URL Category Database (UCDB) and URL Category Searching Server (UCSS). The detailed exchange process is illustrated in the following figure.

20180521155837430003.png

 

The remote query process is provided only for completeness, and the details are not discussed here. Note that Huawei Security Service Center assigns UCDB IP address and port number to the NGFW based on the country/region in which the NGFW is deployed. Therefore, you need to configure the country/region of the NGFW in advance.

In addition, for the NGFW to communicate with Huawei Security Service Center,UCDB, and UCSS, we must configure security policies, as illustrated in the following table.

Source Device

Source IP Address

Source Port

Destination Device

Destination IP Address

Destination Port

Protocol

NGFW

Any

Any

Huawei Security Service Center

Any

80

TCP

NGFW

Any

Any

UCDB

Any

12612

TCP

NGFW

Any

Any

UCSS

Any

12600

UDP

 

Huawei Security Service Center (sec.huawei.com) has a changing IP address, and the IP addresses of the UCDB and UCSS are dynamically assigned. Therefore, we need to set the destination IP address of the three devices to Any in the security policies.

The remote query service requires a license, but covers more and up-to-date URLs. Therefore, this service is recommended to ensure satisfactory URL filtering results.

The URL filtering actions of the NGFW are described as follows:

l  Allow: The NGFW allows the access to the requested URL. If the action is set to Allow, you can also configure the packet priority re-marking function for other network devices to handle URL traffic of different categories based on the modified DSCP priorities.

l  Alert: The NGFW allows the access to the requested URL and generates a log.

l  Block: The NGFW blocks the access to the requested URL and generates a log. The NGFW blocks the user's access request and pushes a web page to describe why the web page cannot be accessed. The content of web page to be pushed can be user-defined.

In real world, regardless of local cache query or remote query, a URL may be found in multiple categories. If the actions for these categories are different, which action will apply? Well, two modes are available on the NGFW:

l  Strict mode: The strictest action is taken on the URL. For example, a URL matches two categories, and the action is alert for one category and block for another category. Then the block action will be taken on the URL.

l  Loose mode: The loosest action is taken on the URL. For example, a URL matches two categories, and the action is alert for one category and block for another category. Then the alert action will be taken on the URL.

A large number of new URLs emerge every day on the Internet. Some URLs may not be included in the URL categories or existing categories may not suit the needs of customers. To address these issues, the NGFW supports user-defined URL categories.

User-defined URL and category

User-defined URLs and categories refer to those manually created on the NGFW by administrators.

lUser-defined URL in a predefined category

Administrator can add new URLs to a predefined category. The action for the predefined category applied to the user-defined URLs added to the category during URL filtering.

l  User-defined category

Administrators can create categories on the NGFW and adds URLs to the category. The action for the user-defined category applies to all URLs in the category during URL filtering.

Recall that a URL typically contains three parts: host, path, and parameter. For a same filtering condition, if the condition is added as a URL rule, a URL will be matched if any part of the URL matches the condition. As a result, some URLs may be falsely matched. Therefore, be careful to determine whether to add a filtering condition as a URL rule or host rule.

URLs in user-defined categories can be flexibly configured. You can use character strings and the wildcard (*) together. Note that the wildcard (*) can be placed before or after, not in the middle of a character string. In addition, you do not need to enter http:// or https:// when you add URLs. The following table lists the complete URL matching modes.

To...

Configure ...

Example

Remarks

Match all URLs that start with the specified character string.

Character string + wildcard

www.example*

Prefix matching. The following URLs can match the example rule:

l  www.example.com

l  www.example.com/hello.html

Match all URLs that end with the specified character string.

Wildcard + character string

*aspx

Suffix matching. The following URLs can match the example rule:

l  www.example.com/hello.aspx

l  www.example.com/news/news.aspx

l  192.168.0.1/sports/abc.aspx

Match all URLs that contain the specified character string.

Wildcard + character string + wildcard

*sport*

Keyword matching. The following URLs can match the example rule:

l  sports.example.com

l  www.example.com/abcsportsabc/

l  192.168.0.1/sports/

Match URLs that contain the specified character string.

Character string

www.example.com

Exact match. The path of the URLs is removed and only the domain name (host) is compared with the URL rule.

The following URLs can match the example rule:

l  www.example.com

l  www.example.com/news

l  www.example.com/news/en/

The following URLs cannot match the example rule:

l  www.example.com.cn/news

l  www.example.org/news/www.example.com

 

The priorities of the URL matching modes for user-defined categories are listed in descending order as follows: exact matching, suffix matching, prefix matching, and keyword matching. The action for the highest-priority URL entry applies.

For example, the URL www.example.com/news matches the following user-defined categories:

l  Exact matching: www.example.com/news

l  Prefix matching: www.example.com/*

l  Keyword matching: *example*

Then, the action for category in the exact matching rule www.example.com/news prevails.

In the same matching mode, a longer matching rule has a higher priority. For example, the following two user-defined categories are prefix matching:

l  Prefix matching: www.example.com/news/*

l  Prefix matching: www.example.com/*

Then www.example.com/news/index.html is considered to belong to the category of www.example.com/news/*. If the lengths of the matched rules are the same, the action depends on whether the action mode is strict or loose mode.

The most important task for category-based URL filtering is to determine the categories of the URLs. Determining the categories of URLs takes some time, regardless of local cache query, remote query, or matching modes of user-defined categories. To save the hassle of category matching and allow or block URLs downright, the NGFW also supports blacklist or whitelist-based URL filtering.

Blacklist/Whitelist-based URL Filtering

Compared with category-based URL filtering, blacklist and whitelist-based filtering is simple and downright: URLs matching a blacklist entry are blocked, and those matching a whitelist entry are allowed.

URLs in the URL blacklist and whitelist are configured using character strings and the wildcard character (*) in the same way as those in user-defined categories.

Overall Processing Flow

We have learned URL filtering based on categories, blacklist, and whitelist. Now it's time to connect the dots and see the whole picture of the URL filtering process, as shown in the following figure.

20180521155838028004.png

 

The steps are briefly described as follows:

1.         1. An enterprise user initiates an HTTP connection request. The request matches a security policy, whose action is allow and requires URL filtering.

2.         The NGFW extracts the URL from the HTTP connection request.

3.         The NGFW searches the whitelist for the requested URL. If the URL is found, the NGFW allows the request. If the URL is not found, the NGFW proceeds to the next step..

4.         The NGFW searches the blacklist for the requested URL. If the URL is found, the NGFW blocks the request. If the URL is not found, the NGFW proceeds to the next step.

5.         The NGFW searches the user-defined categories for the requested URL. If the URL matches a URL category, the NGFW takes the action for the category (if the URL is a user-defined URL in a predefined category, the NGFW takes action for the predefined category). If no match is found, the NGFW proceeds to the next step.

6.         The NGFW searches the hotspot URL database in the cache for the requested URL. If the URL matches a category, the NGFW takes the action for the category. If no match is found, the NGFW proceeds to the next step.

7.         The NGFW checks whether remote query is available. If remote query is unavailable, the NGFW takes the default action configured by the administrator. If remote query is available, the NGFW proceeds to the next step.

8.         The NGFW initiates a remote query to Huawei Security Service Center. If Huawei Security Service Center returns the corresponding category, the NGFW takes the action for the category (when Huawei Security Service Center cannot determine the category, it classifies the URL to the Other category, and the NGFW takes the action for the Other category). If the connection to Huawei Security Service Center is lost or the query times out, the NGFW takes the action for remote query timeout as configured by the administrator.

It is important to understand the overall processing flow of URL filtering to correctly configure user-defined categories, blacklist, and whitelist and actions.

The following table lists some examples of URL filtering and the simple configuration procedure. The configuration details will be described in the next article.

Requirements

Configuration Procedure

Allow the access to specified websites and block the access to all other websites

Whitelist + Block all categories + Default block

Block the access to specified websites and allow the access to all other websites

Blacklist + Allow all categories + Default allow

Allow the access to specified pages on a website and block the access to all other pages of the website

Whitelist (allowed pages) + Blacklist (entire website)

 

As described in the URL format section, NGFW URL filtering applies to both HTTP and HTTPS. HTTP data is not encrypted, and URL information can be extracted directly. In contrast, HTTPS data is encrypted and must be decrypted for extracting URL information. URL filtering for HTTPS is briefly described as follows.

URL Filtering for HTTPS

When a user accesses a website using HTTPS, the user establishes an SSL connection with the website first before the application-layer data is transmitted. As shown in the following figure, the application-layer data is encrypted.

20180521155838543005.png

 

To extract the URL, the NGFW must decrypt the HTTPS data.

We have described the SSL connection establishment process and certificate authentication mechanisms in Season 1. For the NGFW to decrypt HTTPS data, the SSL connection should not be established between the client and HTTPS server. Instead, one SSL connection is established between the client and the NGFW, and another connection is established between the NGFW and the HTTPS server, as shown in the following figure. In this process, the NGFW function as a "proxy", which decrypts the HTTPS data sent from the client, extracts the URL information, and implements URL filtering. If the connection request is allowed, the NGFW encrypts the request and sends it to the HTTPS server.

20180521155839952006.png

 

The NGFW encrypts and decrypts HTTPS data by using SSL decryption policies. You can refer to the product documentation for the configuration of SSL decryption policies. Note that encryption and decryption degrade performance. In addition, for HTTPS protocols that require bidirectional authentication, such as e-banking applications, the NGFW does not support SSL decryption.

By now, we have understood the URL filtering mechanism. In the next article, we will describe the URL filtering configuration details and typical configuration examples. Stay tuned.

 

  • x
  • convention:

wissal
MVE Created May 21, 2018 10:45:29 Helpful(0) Helpful(0)

useful document, thanks
  • x
  • convention:

Telecommunications%20engineer%2C%20currently%20senior%20project%20manager%20at%20an%20operator%2C%20partner%20of%20Huawei%2C%20in%20the%20radio%20access%20network%20department%2C%20for%2020%20years%20I%20managed%20several%20types%20of%20projects%2C%20for%20the%20different%20nodes%20of%20the%20network.
w1
Created May 21, 2018 17:18:39 Helpful(0) Helpful(0)

:)
  • x
  • convention:

Reply

Reply
You need to log in to reply to the post Login | Register

Notice Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " Privacy."
If the attachment button is not available, update the Adobe Flash Player to the latest version!
Login and enjoy all the member benefits

Login and enjoy all the member benefits

Login