Skip to main content

Notes on regex sed, bash DOM html, meta tag

Using bash tool sed, awk, grep, vim ... to edit multiple file. Case study DOM tag <html DOCTYPE, meta, charset...
Sed cheatsheet

HTML meta tags (recommend way ?)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "">
<html xmlns=" " xml:lang="ja" lang="ja">
<meta http-equiv="Content-Type" content="text/html" charset="EUC-JP" />
<meta http-equiv="Content-Script-Type" content="text/javascript" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta name="description" content="content here" />


<!DOCTYPE html>
<html lang="ja">
<meta charset="EUC-JP">
<meta name="description" content="content here">

1. With <!DOCTYPE html : My idea is that delete entire line after match "DOCTYPE html and append ">" close tag.
$  sed -i 's/\DOCTYPE html.*/DOCTYPE html>/' filename

For filename, I use simple way to find . -name "*.html" |xargs grep -rli "DOCTYPE" $1 to get list of html files contain DOCTYPE to edit.
And then use column edit (Vim or Sublime text) to add prefex sed .... to list HTML files.
I think I can use "one line" command solution (use pipe, chain etc.) but it seem take time.

Special case:

-<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"

=> +<!DOCTYPE html>
Because of two line so sed grex have a issue here. Fortunately there are only few of them so I can manually fixed it. Like many other automate task, there are always exceptions, so be careful to double check the Input / Ouput...
And sed, aws, regex used in ultimately many time. In some previous post (should linked here) I also mentioned.

2. <html xmlns=" " xml:lang="ja" lang="ja"> =>
<html lang="ja">

In this case, similar to previous line, I use sed to remove front of match (before lang=...). But it seem there are no good/working answer out there on the Internet, so I try working around by:
- Append first part of line before match with keyword 'delete-me' (should unique), we don't need this part and can be remove later.
- Preend <html to second part (before match words) and keep tail of  second part line.

$ sed -i 's/ lang=/delete-me\n<html&/g'  filename.html

& here mean keyword ' lang=' or tail of second part (I will double check this).
And then sed to remove temporary dump lines:
$ sed -i '/delete-me/d'  filename.html

3. Remove css, js meta tag is straightforward
$ sed -i '/meta http-equiv="Content-Style/d' filename

Replace Content-Style with Content-Script to remove JS meta tag.
Looked at keyword 'meta http-equiv', without meta we will be missed with <script> tag if we only use keyword "Content-Stype"...

4. <meta http-equiv="Content-Type" content="text/html" charset="EUC-JP" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />  (most common case).

=> <meta charset="EUC-JP">

This case seem similar with <html lang="en"> convert above.

$  sed -i 's/; charset=UTF/delete-me&\n<meta charset=\"UTF/g' filename

Becase of this keyword "charset=" could be mixed in <script> or another tag:
<script ... charset="..."> so, we have to added '; ' in the keyword.

The most occurrence case is content="...; charset=...".
I think I made a small mistake here. I should not or do not need UTF in the keyword. The reason is that when I firstly copied some command from another tut, '&' in sed regex placed at last '&/g'. This one is for append or preend keyword (or tail ?) part before or after keyword.
So I will try to run without UTF later. (But I think I too lazy to try again :)

Anyway, it work.

We need another sed update.
$ sed -i 's/; charset=\"/delete-me&\n<meta charset=/g'  file2edit.html

This time for all "; charset=" match. And then run sed to remove all dump line contain 'delete-me'.

At first, I think we can only use one sed for this case, apply to keyword '; charset=' or simple ' charset=' if we ignore <script> tag since it appear only a few and can be manually edited.
But the real reason this second sed run is required is because of, ie. this case:
<meta http-equiv="Content-Type" content="text/html" charset="EUC-JP" />

You can see, we do not have "content='... ; charset=...". Focus on '; ' character. We have to remove whole front part of line before keyword. So second sed run is needed.

Sed change multiple files

Diff two text file to get different, only get different line. The idea is that sort two file first and then get different.


Popular posts from this blog

AWS Elasticache Memcached connection Access memcached Zip include hidden file phpmemcachedadmin ~ phpMyAdmin or phpPgAdmin ... telnet 11211 stats items stats cachedump 27 100 VPC ID Security Group ID (sg-...) Cluster: The identifier for the cluster memcached1 Creation Time: The time (UTC) when the cluster was created January 9, 2019 at 11:47:16 AM UTC+7 Configuration Endpoint: The configuration endpoint of the cluster St...

Simulate Fail2ban on Apache request spam with mod_evasive limitipconn ... trap command exit 1 not work it seem { } brace bound fixed it. cat access_log | cut -d ' ' -f 1 > ip1 sort -n -t. -k1,1 -k2,2 -k3,3 -k4,4 | uniq -c | sort -n -r -s  Code: ------------------------------------------------------------------- #Block Spam Bots and Spam on your website #Block proxies...

Notes Windows 10 Virtualbox config, PHP Storm Japanese, custom PHP, Apache build, Postgresql

 cmd => Ctrl + Shift + Enter mklink "C:\Users\HauNT\Videos\host3" "C:\Windows\System32\drivers\etc\hosts" 1 NAT + 1 host only config DEVICE=eth0 TYPE=Ethernet #BOOTPROTO=dhcp BOOTPROTO=none #IPADDR= #PREFIX=24 #GATEWAY= #IPV4_FAILURE_FATAL=yes #HWADDR=08:00:27:CC:AC:AC ONBOOT=yes NAME="System eth0" [root@localhost www]# cat /etc/sysconfig/network-scripts/ifcfg-eth1 # Advanced Micro Devices, Inc. [AMD] 79c970 [PCnet32 LANCE] DEVICE=eth1 IPADDR= <= no eff => auto like DHCP #GATEWAY= #BOOTPROTO=dhcp BOOTPROTO=static <= no eff ONBOOT=yes HWADDR=08:00:27:b4:20:10 [root@localhost www]# ...