Overview

the link_parser.py gets urls in that page the out1.json file is the output i got while running it with www.google.com for 5 levels (ie 5th hierarchy).try it out with a big site for 5 levels and the program will never end ( :P )

get_page.py will download the entire page and changes the all third-party urls to relative urls. it assumes that a local server is running. so just double clicking on the downloaded file won't work.

Usage python get_page.py make sure to excute the script in a seperate folder

Use python link_parser.py to get all the links in the page use online json viewer to view the links . http://jsonviewer.stack.hu/

cat out.json | python -mjson.tool > pretty_out.json for pretty output

link_parser2.py

main2(url,universal_limit)

creates an array of this type

{{{

!python

    main_array = [ 
           ["some_string0",-1,1], 
           ["some_string1",-1,1] , 
           ["some_string2",-1,1] ,
           ["some_string3",0,-1],
           ["some_string4",0,-1],
           ["some_string5",0,-1],
           ["some_string6",1,-1],
           ["some_string7",1,-1],
           ["some_string8",2,-1],
           ["some_string9",5,-1],
           ["some_string10",7,-1],
           ["some_string11",9,-1],
           ["some_string12",11,-1],
           ["some_string13",11,-1],
           ["some_string14",11,-1]
         ]

}}} this means . that string0 has urls -> string3 ,4 ,5 and string5 has urls->

this second element in the object array is its parent number

this in turn is converted into {{{

!python

             [[0,
          -1,
          'some_string0',
          [[3, 0, 'some_string3', []],
           [4, 0, 'some_string4', []],
           [5,
            0,
            'some_string5',
            [[9,
              5,
              'some_string9',
              [[11,
                9,
                'some_string11',
                [[12, 11, 'some_string12'],
                 [13, 11, 'some_string13'],
                 [14, 11, 'some_string14']]]]]]]]],
         [1,
          -1,
          'some_string1',
          [[6, 1, 'some_string6', []],
           [7, 1, 'some_string7', [[10, 7, 'some_string10', []]]]]],
         [2, -1, 'some_string2', [[8, 2, 'some_string8', []]]]]

}}} this will be the output and its stored into the out.txt file.

the output file is generated even after pressing ctrl+c formatting the all the urls calculated up untill now

you can view formatted result in ipython

load in

ipython (%paste a)

use

import ast

x=ast.literal_eval(a[0])

now x[0] gives you the values of its url and all its children .

each child will be in the same format as x[0] if it has any children too