Specifying the data to retrieve from a Hadoop system

Accessing Data : Accessing Hadoop data in a report : Specifying the data to retrieve from a Hadoop system

To specify what data rows to retrieve from a Hadoop system running Hive, you write a query using HQL. As mentioned earlier, HQL is similar to SQL. HQL supports many of the same keywords as SQL, for example, SELECT, WHERE, GROUP BY, ORDER BY, JOIN, and UNION.

Hive transforms HQL statements into MapReduce jobs that Hadoop uses to perform and manage parallel processing across the clusters of servers. You can embed your own MapReduce scripts in the query by using the TRANSFORM clause. You make these scripts available to Hadoop through the Add File property when you configure the connection properties, as described in the previous section.

The following is an example of a HQL query that uses the TRANSFORM clause:

SELECT

TRANSFORM (userid, movieid, rating, unixtime)

USING 'python weekday_mapper.py'

AS (userid, movieid, rating, weekday)

FROM u_data