1. Sai Krishna k
  2. tmp_parser



the link_parser.py gets urls in that page the out1.json file is the output i got while running it with www.google.com for 5 levels (ie 5th hierarchy).try it out with a big site for 5 levels and the program will never end ( :P )

get_page.py will download the entire page and changes the all third-party urls to relative urls. it assumes that a local server is running. so just double clicking on the downloaded file won't work.

Usage python get_page.py make sure to excute the script in a seperate folder

Use python link_parser.py to get all the links in the page use online json viewer to view the links . http://jsonviewer.stack.hu/

cat out.json | python -mjson.tool > pretty_out.json for pretty output



creates an array of this type



    main_array = [ 
           ["some_string1",-1,1] , 
           ["some_string2",-1,1] ,

}}} this means . that string0 has urls -> string3 ,4 ,5 and string5 has urls->

this second element in the object array is its parent number

this in turn is converted into {{{


          [[3, 0, 'some_string3', []],
           [4, 0, 'some_string4', []],
                [[12, 11, 'some_string12'],
                 [13, 11, 'some_string13'],
                 [14, 11, 'some_string14']]]]]]]]],
          [[6, 1, 'some_string6', []],
           [7, 1, 'some_string7', [[10, 7, 'some_string10', []]]]]],
         [2, -1, 'some_string2', [[8, 2, 'some_string8', []]]]]

}}} this will be the output and its stored into the out.txt file.

the output file is generated even after pressing ctrl+c formatting the all the urls calculated up untill now

you can view formatted result in ipython

load in

ipython (%paste a)


import ast


now x[0] gives you the values of its url and all its children .

each child will be in the same format as x[0] if it has any children too