Data science Software Course Training in Ameerpet Hyderabad

Data science Software Course Training in Ameerpet Hyderabad

Monday 8 May 2017

Pig : load Operator


Load Operator:
--------------
 to load data from file to relation.
 [cloudera@quickstart ~]$ cat > samp1
100 200 300
400 500 900
100 120 23
123 900 800
[cloudera@quickstart ~]$ hadoop fs -copyFromLocal samp1  piglab
[cloudera@quickstart ~]$

grunt> s1 = load 'piglab/samp1' using PigStorage('\t')
>>          as (a:int, b:int, c:int);
grunt> s2 = load 'piglab/samp1' using PigStorage()
>>          as (a:int, b:int, c:int);
grunt> s3 = load 'piglab/samp1'
>>          as (a:int, b:int, c:int);
grunt> dump s3
(100,200,300)
(400,500,900)
(100,120,23)
(123,900,800)
grunt> dump s2
(100,200,300)
(400,500,900)
(100,120,23)
(123,900,800)
grunt> dump s1
(100,200,300)
(400,500,900)
(100,120,23)
(123,900,800)
outputs of s1, s2 , s3 are same.
 in s2, PigStorage() is with \t delimiter.
 in s3, among PigStorage() and BinStorage()
   PigStorage() is applied by default with \t delimiter.

 the meaning of s1, s2 ,s3  is same.
s4 = load 'piglab/samp1'
      as (a:int, b:int, c:int, d:int)
 first 3 fields of file are mapped with a,b,c fields,
   there is not 4th field in the file,
  so d will become null.
grunt> dump s4
(100,200,300,)
(400,500,900,)
(100,120,23,)
(123,900,800,)
-- following is to skip last fields.
grunt> s5 = load 'piglab/samp1'
>>    as (a:int, b:int)
>> ;
grunt> illustrate s5
--------------------------------
| s5     | a:int    | b:int    |
--------------------------------
|        | 100      | 120      |
--------------------------------
-- but to skip middled fields, take help of foreach operator. [later]
loading  non tab delimited files into PigRelation

[cloudera@quickstart ~]$ cat > samp2
100,10,1
2,200,20
3,30,300
[cloudera@quickstart ~]$ hadoop fs -copyFromLocal samp2 piglab
grunt> ss1 = load 'piglab/samp2' as (a:int, b:int, c:int);
(,,)
(,,)
(,,)
 here load is expecting \t delimiter,
  but file has 0 tabs.
   so entire line  is  one field which is string.
  this has to be mapped with first field of relation , which is a but as int.
  so a became null. file does not have 2 nd, 3 rd fields , thats why b, c fields bacame null.
grunt> ss2 = load 'piglab/samp2'
         as    (a:chararray, b:int, c:int);
(100,10,1,,)
(2,200,20,,)
(3,30,300,,)
grunt> ss3 = load 'piglab/samp2'
        using PigStorage(',')
       as (a:int, b:int, c:int);
grunt> dump ss3
(100,10,1)
(2,200,20)
(3,30,300)
grunt> cat piglab/emp
101,aaaa,40000,m,11
102,bbbbbb,50000,f,12
103,cccc,50000,m,12
104,dd,90000,f,13
105,ee,10000,m,12
106,dkd,40000,m,12
107,sdkfj,80000,f,13
108,iiii,50000,m,11
grunt> emp = load 'piglab/emp'
>>     using PigStorage(',')
>>    as (id:int, name:chararray, sal:int, sex:chararray, dno:int);
grunt> illustrate emp
-------------
| emp     | id:int    | name:chararray    | sal:int    | sex:chararray    | dno:int    |
----------------------------------------------------------------------------------------
|         | 104       | dd                | 90000      | f                | 13         |
----------------------------------------------------------------------------------------








































2 comments:

  1. I/P
    -------------
    (1,2,3),(4,5,6)
    (7,8,9),(10,11,12)
    (13,14,15),(16,17,18)

    script
    ------------

    grunt> data = load 'complexDTAccess' USING PigStorage(',') as (t1:tuple(t1a:int,t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int));
    grunt> output1 = foreach data GENERATE t1.t1a as first, t2.t2c as last;
    grunt> dump output1;

    can you please tell me why it is not loading the data.

    ReplyDelete