Pierre Neidhardt avatar Pierre Neidhardt committed 235387c

Wikils can list prefixes.

Comments (0)

Files changed (5)

+2013-02-09 1.3
+	* wikils: New script for listing of category and prefix pages.
+
+2013-02-08 1.2
+	* Documentation fixes.
+
+2013-02-08 1.1
+	* Typo fixes.
+
+2013-02-07 1.0
+	* wikiex: First public release.
+
+
 Wikiex is a simple script that will retrieve wiki source files for the given
 wiki pages. Currently only MediaWiki is supported.
 
+Wikils is a sister script that will work out category or prefix special pages to
+output one netry per line. It can be used in combination with Wikiex to batch
+download pages of specific categories or prefixes.
+
 Usage
 =====
 
-See 'wikiex -h'.
+See 'wikiex -h' and 'wikils -h'.
 
 Dependencies
 ============
 * Support of batch downloads using special wiki pages like categories or prefix.
 * Support of other wiki types.
 * Log support.
-* Bug: sometimes wikils output is empty. Seems to be an issue with wget -q option.
+* Bug: sometimes wikils output is empty. Seems to be an issue with wget -q option.
+* Wikils: subcategories support.
   -o FOLDER  Set output folder. Default is "$OUTPUT_FOLDER".
   -p         Do not overwrite existing pages.
   -r         Replace whitespaces by underscores.
-  -s LINK    Set main web site for pages without full link. Default is "$MAIN_SITE".
+  -s URI     Set main web site for pages without full link. Default is "$MAIN_SITE".
   -v         Print version.
 
 Examples:
 DATE="2013"
 VERSION="1.3"
 
-MAIN_SITE="http://en.wikibooks.org/wiki"
-CAT_PARAM="Category:"
-PREFIX_PARAM="Special:PrefixIndex/"
-PARAM="$PREFIX_PARAM"
-
 if [ "$command -v wget)" = "" ]; then
     echo "ERROR: wget not found."
     exit
     exit
 fi
 
+MAIN_SITE="http://en.wikibooks.org/wiki"
+CAT_PARAM="Category:"
+PREFIX_PARAM="Special:PrefixIndex/"
+PARAM="$PREFIX_PARAM"
+
+PREFIX_FILTER='BEGIN {
+    RS="<|>" 
+    IN_PREFIX=0
+    DONE=0
+}
+
+IN_PAGES && /title=/ {
+    gsub(/.*title="/,"")
+    gsub(/".*/,"")
+    print
+}
+
+IN_PAGES && $0 !~ "/table" && /\<table\>/ { IN_PAGES++ }
+/table id="mw-prefixindex-list-table"/    { IN_PAGES=1; DONE=1 }
+IN_PAGES && /\/table/                     { IN_PAGES-- }
+DONE && IN_PAGES==0 && /\/table/          { exit }'
+
+
+CAT_FILTER='BEGIN {
+    RS="<|>" 
+    IN_PAGES=0
+    DONE=0
+}
+
+IN_PAGES && /title=/ {
+    gsub(/.*title="/,"")
+    gsub(/".*/,"")
+    print
+}
+
+IN_PAGES && $0 !~ "/div" && /\<div\>/ { IN_PAGES++ }
+/div id="mw-pages"/                   { IN_PAGES=1; DONE=1 }
+IN_PAGES && /\/div/                   { IN_PAGES-- }
+DONE && IN_PAGES==0 && /\/div/        { exit }'
+
+FILTER=$PREFIX_FILTER
+
 printhelp()
 {
     cat<<EOF
 Usage: $1 [OPTION] NAME
 
-NAME can be either a category or a prefix.
+This script will request a special page to a wiki, like a category page or a
+prefix page.  (A prefix means 'all pages that begins with'.) Currently this only
+works on MediaWiki sites. Once the page is fetched, it gets processed by AWK to
+limit output to the desired pages or categories, one entry per line.
 
 Options:
   -h         Show this help.
   -c         List pages belonging to category.
   -p         List pages belonging to prefix. This is the default.
-  -s LINK    Set main web site for pages without full link. Default is "$MAIN_SITE".
+  -s URI     Change web site. Default is "$MAIN_SITE".
   -v         Print version.
 
 Examples:
             echo "$NAME $VERSION"
             echo "Copyright © $DATE $AUTHOR"
             exit ;;
-        c) PARAM="$CAT_PARAM" ;;
-        p) PARAM="$PREFIX_PARAM" ;;
-        s) MAIN_SITE="$OPTARG";;
+        c) 
+            PARAM="$CAT_PARAM"
+            FILTER=CAT_FILTER;;
+        p) 
+            PARAM="$PREFIX_PARAM"
+            FILTER=PREFIX_FILTER;;
+        s) 
+            MAIN_SITE="$OPTARG";;
         ?)
             printhelp "$0"
             exit ;;
 echo "Page is $PAGE."
 echo
 
-wget  -q -O - "$PAGE" | gawk 'BEGIN {
-    RS="<|>" 
-    IN_PAGES=0
-    DONE=0
-}
-
-IN_PAGES && /title=/ {
-    gsub(/.*title="/,"")
-    gsub(/".*/,"")
-    print
-}
-
-IN_PAGES && $0 !~ "/div" && /\<div\>/ { IN_PAGES++ }
-/div id="mw-pages"/                   { IN_PAGES=1; DONE=1 }
-IN_PAGES && /\/div/                   { IN_PAGES-- }
-DONE && IN_PAGES==0 && /\/div/        { exit }'
-
-
-
-# 'BEGIN { RS="title=\"" ; FS="\""} /class="printfooter"/ {exit} NR>1 {print $1}'
-# 'BEGIN { RS="title=\"" ; FS="\""} /Books or Pages/,/printfooter/ && NR>1 {print $1}'
-# 'BEGIN { RS="title=\"" ; FS="\""} /are in this category/,/printfooter/ && NR>1 {print $1}'
-# 'BEGIN { RS="title=\"" ; FS="\""; var=0} /Saved in parser cache/,/printfooter/  {print $1}'
-
-# 'BEGIN { RS="title=\"" ; FS="\""; skip=0} /Saved in parser cache/,/printfooter/ { if(skip++) print $1}'
-
-
+## Go!
+wget  -q -O - "$PAGE" | gawk "$FILTER"
 
-# Markup
-# <div id=mw-subcategory>
-# <div id="mw-pages">
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.