Got it

[ Technical Dry Goods ] [Data Lake Case] ​​One-stop large database business development

210 0 0 0 0

Hello, everyone!

Today I'm going to introduce you DAYU


If you have already applied for big data products such as MRS, DLI, DWS, etc., in the next time, you will hope to quickly run your business on big data services. Data Lake Factory (DLF) is an online one-stop big data development platform. Assist users to quickly move to the cloud. This paper uses Data Lake Factory (DLF) and Data Lake Exploration (DLI) to analyze the raw data of movie ratings, and output the highest-rated and most active TOP 20 movies. Through this article, you can learn DLF script editing, job editing, job scheduling and other functions, as well as the basic SQL syntax of DLI.

  • Data Lake Factory provides a one-stop big data collaborative development platform to help users easily complete data modeling, data integration, script development, job scheduling, operation and maintenance monitoring and other tasks, which can greatly reduce user usage The threshold of big data helps users quickly build big data processing centers.

Data preparation The
demo data comes from https://grouplens.org/datasets/movielens/ , for the convenience of presentation, it is saved in the attachment of this article. Please upload the data in the attachment to the OBS bucket, and then DLF and DLI will directly process the data in the bucket.

Original data description
movies.csv saves basic movie information, fields: movie ID, name, genre
movie table.JPG 
ratings.csv saves movie ratings, fields: user ID, movie ID, rating, rating time, rating range: 0~5
 rating table.JPG

To create a database table,
we need to create two tables, movie and rating, to store the original data, and create top_rating_movie and top_active_movie to store the analysis results, that is, the TOP 20 movies with the highest ratings and the most active.

Create table SQL can be executed using DLI or DLF editor. Below, use DLF editor to execute SQL.
First, you need to create a DLF to DLI connection.
 Create DLI connection.JPG

Then create a SQL script in "Data Development" and open the script editor.
Script editing.JPG 
Enter the following table creation statement to create 4 tables.


create table movie(movieId int,title varchar(256),genres varchar(256)) using csv options (path's3a://obs-movies/movies.csv');
create table rating(userId int,movieId int,rating float,rating_date long) using csv options (path's3a://obs-movies/ratings.csv');
create table top_rating_movie(title varchar(256),avg_rating float,rating_user_number int);
create table top_active_movie(title varchar(256 ),avg_rating float,rating_user_number int);


Description

Movie and rating are OBS tables, data is stored on obs, top_rating_movie and top_active_movie are DLI tables, which store analysis results.

After the movie and rating tables are created successfully, you can use the SELECT statement to query the data in the obs files s3a://obs-movies/movies.csv and s3a://obs-movies/ratings.csv.
 

Analyze the data
to create SQL scripts, calculate the highest rated and most active TOP 20 movies from the movie and rating tables, and store the results in the top_rating_movie and top_active_movie tables.
Calculate the content of the top_rating_movie script

insert overwrite table top_rating_movie
select
  a.title,
  b.ratings / b.rating_user_number as avg_rating,
  b.rating_user_number

from
  movie a,
  (
    select
      movieId,
      sum(rating) ratings,
      count(1) as rating_user_number

    from
      rating
    group by
      movieId
  ) b
where
  rating_user_number> 100
  and a.movieId = b.movieId
order by
  avg_rating desc
limit
  20

This SQL first uses a sub-SELECT statement to calculate the total rating of each movie and the number of user ratings on the rating table according to movieId. The sub-SELECT language and movie are used as a join query to filter out records with a user rating less than 100, and then the movie name, average rating and user rating are returned. Finally, the result is saved to the table top_rating_movie.

Calculate the content of the top_active_movie script

insert overwrite table top_active_movie
select
  a.title,
  b.ratingSum / b.rating_user_number as avg_rating,
  b.rating_user_number

from
  movie a,
  (
    select
      movieId,
      sum(rating) ratingSum,
      count(1) as rating_user_number

    from
      rating
    group by
      movieId
  ) b
where
  avg_rating> 3.5
  and a.movieId = b.movieId
order by
  rating_user_number desc
limit
  20


The calculation method for the most active TOP20 movies is that the average rating is greater than 3.5 and the maximum number of user ratings is 20 movies.

Refresh TOP20 data regularly every day.
We assume that the movie and rating tables are changing in real time, so we hope to update TOP20 movies every day; here, DLF job scheduling and job scheduling functions can be used.
First create a DLF job
 Assignment schedule.JPG

There is a Dummy node and two DLI SQL nodes in the job. The two DLI SQL nodes are each associated with the DLI SQL scripts (top_rating_movie and top_active_movie) that calculate the highest score and the most active TOP20 movies.
 DLI SQL node configuration.JPG

After the arrangement is complete, click "Run Test" to test the job. The running log is as follows.

Run log.JPG 
If the log shows normal operation, configure the job scheduling policy.
 Scheduling strategy.JPG

Configuration strategy: execute a job at 0 o'clock every day. After the completion is complete, click Save, and then click Submit Job. This will automatically run the job every day and automatically save the TOP 20 movies to the table (top_rating_movie and top_active_movie).

The job monitoring
job is automatically run every day, and users hope to know whether the job execution result is success or failure in time. Users can understand the job results through DLF's job monitoring interface, email notifications, and SMS notifications.
 

Job monitoring.JPG


Comment

You need to log in to comment to the post Login | Register
Comment

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.