Monday, January 14, 2013

Download an entire blog from blogger.com

*Updated table with some improvements*

I have been reading a site called Queryshark as part of the process to refine a query letter for courting literary agents. Most of the time that I have to read is on the bus, and while I do now have internet access, the connection is spotty along the route, and I am somewhat bandwidth constrained. As such, I opted to download the site to read it offline.

This is not as easy as I thought it would be.

Originally I tried a program called backstreet browser, but try as I might, I never got the settings to work so that it would download the entire archive. Out of 200+ entries, it only ever grabbed about 15 or so.

And I thought to myself, “there’s a better way, I just know it.”

So, here is a handy dandy guide if you ever wish to transform a blogger site into something that easy to read for offline viewing. I wasn’t going to post this, as it required some commercial software, but Microsoft just made Expression Web 4 free for all. If you want a free web page editor, you aren’t going to do better.

Disclaimer: These instructions assume that there is an RSS feed setup for the site, and you’re using windows.

1. Download and use blogger backup. The settings are fairly simple. One thing I had trouble with is downloading with comments to a single file. If you want the comments, I recommend you have one file per post. This creates a file for every single comment, but we can combine them later. Choose a date where you want to start, or get everything, and click the go button.

2. Go do something else for a while.

3. Once done, you’ll have a folder filled with xml files. The backup utility is designed for pulling all of one’s data from blogger for migrating it to a different blogging platform. As such, the files are not especially readable. There’s also a lot of them, but consolidating them is surprisingly easy.

4. Launch the command prompt and navigate to the folder where the files are kept. Now type in the following command:

copy *.xml consolidated.txt

This will convert all of the files in chronological order of oldest to newest into a single file. Although, the comments will be reversed for a given post, so that the most recent comment for a given post will be at the top.

5. Download and install Notepad++.

6. Open the text file. At the top of the file, add

<html>

<body>

At the bottom of the file, add

</body>

</html>

File –> Save As, under options choose “hypertext markup language”

Save as consolidated.html

7. Here’s where things get a bit hairy. Under Search –> Replace (CTRL+H)

Find Replace
<id> <!--<id>
</id> </id>-->
<email> <!--<email>
</email> </email>-->
<updated> <!--<updated>
</updated> </updated>-->
<uri> <!--<uri>
</uri> </uri>-->
<published> <!--<published>
</published> </published>-->
<title type="text"> &lt;h2&gt;
</title> &lt;/h2&gt;

This is commenting out a bunch of xml metadata that isn’t relevant to reading the posts and comments. These were determined by trial and error, so there may be more or less depending on the nature of the blog and when this is performed.

Save and close the file.

8. Open a web browser, preferably IE or firefox (I had trouble getting this all to work in Chrome), and open the file consolidated.html. If you aren’t sure how to open a local file, hit CTRL+O (the letter, not the number).

9. Go do something else. This can actually take a bit depending on the size of the file. The browser can interpret the XML and translates it into HTML code. It’s not going to look right, but that’s okay for now.

10. Once it’s done loading the site, Choose File –> Save As. Under options, choose a text file .txt. Called it consolidated2.txt

11. Open consolidated2.txt in notepad++.

Redo the html code listed above:

At the top of the file, add

<html>

<body>

At the bottom of the file, add

</body>

</html>

File –> Save As, under options choose “hypertext markup language”

Save as consolidated2.html and close.

12. Now go to the browser and open consolidated2.html.

You should now have something that is mostly readable, or at least you can tease the content out of the remaining cruft.

13. Here is where Expression Web comes in. Open the file consolidated2.html.

14. Split consolidated2.html into smaller pages with fewer entries. It was worth consolidating to do the bulk of the work all at once, but just 200 entries with comments is enough code to bring just about any browser to its knees. After a while, it gives up trying to parse the error prone html and the stuff at the bottom just looks weird. 25 entries seems to strike a good balance of readability and browser speed.

15. The nice thing about using a web editor like Expression is that you can have the page and the underlying code open side by side. It will also call out html code that was opened but never closed. In my experience, a lot of things that get italicized with the <i> command never get a closing </i> command. You can find those quickly and easily using this tool.

16. The one thing I never figured out is how to detect comments vs. actual posts, so they are treated equally. When cleaning up the document for easier reading, I opted to give actual posts Header 2 <h2> and posts Header 3 <h3>

17. That’s about it, but some handy dandy tricks for using the tool:

CTRL + (down arrow) jumps to the next header. If you have a lot of text and don’t want to scroll while editing, that gets you there a lot quicker.

If you highlight something and hit CTRL+SHIFT+S, you can change the header immediately.

You’ll get random characters at the end of some comment titles that happen to truncate at an apostrophe. I have no idea why.

Starting at the bottom of a page, near the top of the window you can see all of the open html codes that need to be closed. You can click on them to jump straight to that code wherever it is in the page. You can right click and choose remove directly. NEVER click on it and hit the delete key! The act of clicking on the code highlights everything on page between where you are and where that piece of code is, and hitting delete gets rid of all of that content.

CTRL+Z is your friend Smile

No comments: